Object detection and tracking allow the control unit of an autonomous vehicle to plan the proper driving actions to be taken, accounting for the surrounding environment. As a basic module of object tracking, object detection can be conducted in traditional handcrafted feature-based methods as well as deep learning methods. Traditional methods, leveraging on multiple low-level features, are composed of three steps: segmentation, hand-engineered feature extraction, and classification. Compared with this solution, deep learning methods exploit semantic, high-level, and deeper feature representations directly from data. These data-driven deep learning methods have made major breakthroughs and progress in detection accuracy and timeliness .
Object detection is a hot research topic for what concerns the deep learning based approaches. Due to the great variety of sensors implemented in automotive applications, deep learning methods have been developed both for 2D and 3D raw data. Regarding methods for 2D data, the most widely used architectures take images as input: basically, they can consist of convolutional neural network (CNN) with region proposal as R-CNN , SPP-Net , Fast and Faster R-CNN [11, 27], or of single-shot detectors as YOLO and SSD [26, 17]. However, although these methods based on camera images are widely used for object classification due to the high semantic content they provide , they are usually affected by the lack of spatial information. That leads to lower accuracy in object position estimation, and also make the process more sensitive to occlusions .
Compared with 2D methods, 3D methods take point-clouds as input data. A rotating or solid-state LiDAR sensor provides 3D positions, as well as reflections intensity of each point surrounding the vehicle. This representation is more accurate and spatial-rich; thus, it can improve both the object detection and classification process and even reduce the negative effects due to occlusions. However, a 3D point-cloud generally needs to be re-arranged to be directly used as an input of a 2D object detection method. This paper deals with the problem of using a LiDAR point-cloud in a 3D object detection method to ensure object detection for an autonomous vehicle.
This paper presents a comparison between two different projection-based methods that have been implemented within the state estimation routine of a prototype of an autonomous vehicle displayed in Fig. 1
. The first solution is based on the open-source Apollo FCNN-based object detection algorithm; the second is instead a geometric based pipeline for 3D point-clouds processing developed in our labs. The experimental vehicle is equipped with a 16-beams LiDAR sensor. All the experimental tests have been carried in the Monza ENI circuit. The ground truth for the validation and comparison of the methods is based on GPS measurements of the target object corrected with RTK service. Both algorithms are implemented on ROS , the deep learning based approach runs on an NVIDIA GTX 1050Ti, while the geometric one has been implemented to work on CPU only.
This paper is divided as follows; in Section II we illustrate the current state of the art concerning point-cloud based obstacle detection, with particular interest on the deep learning based approaches. The FCNN-based objects detection algorithm is presented in Section III while our geometric-based method is described in detail in Section IV. The experimental setup is presented in Section V while Section VI presents the analysis of the results in terms of a comparison between the position estimates and a ground truth for both proposed solutions.
Ii Related Works
Projection-based methods implement a single or multi-view projection of a 3D point-cloud, resulting in a 2D grid, which is then processed to find object clusters with the desired confidence. This grid is then processed by a 2D CNN, or to a traditional pipeline. Feature extraction is done during the projection of points on a horizontal plane, discretized with an assigned grid of pre-determined dimensions. In such a way, different channels can be added to improve the number of features available (e.g., height, density, intensity, occupancy, etc.) within the corresponding grid cell. Next, these channels are stacked together and treated as 2D image. Complex-YOLO , BirdNet , PIXOR  map point cloud into Bird’s Eye View (BEV). LMNet , VeloFCN  takes the frontal view (FV) of point cloud as input. MV3D 
adopts both BEV and FV of point cloud as input. BEV is widely used due to its lower probability of occlusion w.r.t FV. Projection-based methods shrink dimensions of point cloud and computational cost through projection, meanwhile causing inevitable spatial information loss. These methods actually achieve a trade-off between accuracy and computational cost.
Volumetric convolutional methods first conduct point cloud voxelization representing 3D point cloud as regularly spaced 3D voxel grids. Features such as point intensity, height, density and occupancy are extracted manually from points within corresponding particular voxel cells. Then 3D convolutions are adopted to process these voxels [15, 8, 35]. These methods encode spatial information of point cloud explicitly and have less information loss compared to projection-based methods, thus producing satisfactory detection accuracy. However, due to the expensive computational cost of 3D convolution and existing empty voxels caused by point cloud sparsity, volumetric convolutional methods are time-consuming and inefficient.
Projection-based methods and volumetric convolutional methods aim to convert point-clouds into 2D images or 3D voxel grids. Differently, raw point-cloud-based methods directly handle point-clouds to minimize spatial information loss. Most of the raw point-cloud-based methods are variants of PointNet, widely used for object classification and semantic segmentation. PointNet++  is the advanced version of PointNet and can extract local features efficiently, whilst Frustum PointNet  allows to construct subsets of point-clouds based on 2D detections on the image plane. Then these subsets are fed directly into a PointNet for classification and prediction. PointNets methods assume segmented objects and are mainly used for simple and indoor scenes in robotics, so they are not widely used in autonomous driving.
In conclusion, compared to the other two subcategories, projection-based methods are well-researched in the context of driving scenarios due to its similarity to mature 2D object detection methods. Even if, based on hand-crafted feature extraction, they offer a good trade-off between time complexity and detection performance .
Iii Fcnn-based Object Detection
Apollo is an open autonomous driving platform, which has released all the most important modules for the implementation of autonomous driving.For what concerns the perception task, Apollo uses Fully Convolutional Neural Network  (FCNN) to conduct segmentation of point-clouds provided by LiDAR sensors. Apollo trained and tested its FCNN-based object detection model using its own large-scale dataset, ensuring high level of robustness and accuracy.
Apollo FCNN-based object detection model is composed of 3 steps :
Channel feature extraction: the 3D point-cloud provided by the LiDAR is projected in a BEV image with pre-determined width, height and grid cell size. To avoid loss of information during the projection of 3D point-cloud into a 2D image, 6 additional channels shown in Fig. 2 are stacked together the new pattern to recover information about the peak and the medium values of height, intensity and distribution of the collapsed points for each cell. Moreover, binary information concerning the effective occupancy of each grid are included.
FCNN-based obstacle prediction: the 6 extracted feature channels, together with the BEV grid image, are provided to the Apollo FCNN detection model presented in Fig. 3. Assuming a unitary batch size , i.e. a unitary point-cloud frame, a 6x672x672 BEV image is used as input for the estimation process. Apollo FCNN-based model is composed of convolutional layers (named as encoder), deconvolutional layers (decoder) and prediction layers (predictor). The former part of the FCNN is defined by 15 consecutively stacked 15 convolutional layers to downsample the spatial resolution of the input BEV image and extract increasingly complicated features as the layers get deeper. Then, 10 tightly connected deconvolutional layers upsample the encoded BEV image to the spatial resolution of the input BEV image. Skip connection technique is used to promote BEV image’s spatial details recovery during deconvolution process as well as better deep neural network training . The algorithm provides as output a 672x672 BEV image with 6 attributes for each grid cell related to its effective occupancy, and eventually to its position, confidence prediction, class, absolute angle and height.
Obstacle clustering and post-processing: clustering is performed on the output BEV image accounting for the 6 attributes related to each occupied grid cell, basing on a union find algorithm. Then, during post-processing a confidence value is assigned for each candidate clustered object, in order to get a more accurate final output.
To conclude, for each individual detected object (stationary and moving), the estimation process provides the following information:
detection confidence score;
position w.r.t. lidar coordinate system;
main dimensions, i.e. length, width and height;
classification, e.g. car, truck, cyclist or pedestrian;
main distance, i.e. distance between the object’s centroid and lidar origin.
ROS is used to implement and test Apollo FCNN-based 16-beams-lidar detection model. Running on a NVIDIA GTX 1050Ti, the algorithm provides estimates at 10 HZ with KITTI dataset .
Iv Geometric-based Object Detection
As stated in the introduction, the second analyzed approach does not leverage on deep learning but only on geometric and morphological transformation to retrieve the obstacles from the 3D point-cloud. The pipeline in Fig. 4 show the operations required to convert a set of 3D points into a list of obstacles on the horizontal plane. The presented pipeline handles the conversion from a 3D point-cloud to a 2D occupancy grid, including the final tasks of clustering, identification and tracking.
The first step of this pipeline is to remove all the detected points belonging to the ground plane, i.e. the road surface, in order to reduce the incidence of false positives in the estimation process. To perform this task, an approach similar to the one presented in  is implemented, in which the plane fitting problem is based on RANSAC (RANdom SAmple Consensus).
Once the ground plane is removed, all the remaining points are most likely belonging to obstacles. Thus, they are projected on a 2D plane to obtain a set of 2D points on a plane parallel to the road surface. Discretization is then carried out through the application of a grid on the identified horizontal plane. In particular, the grid is divided in square cells with side equal to : if the number of points in the cell is higher than a pre-computed threshold, the cell is set to occupied. Then, a further threshold parameter is applied to filter out noise effects that may eventually lead to false positives. Both the described parameters have been tuned basing on experimental measurements in controlled environment, considering decreasing values depending on the radial distance from the sensor to take into account the variable density of the point-cloud .
The output of the previous phase is a simplified representation of the area surrounding the ego-vehicle, which is used to obtain the relative position of each object close to the vehicle. The occupancy grid provides information regarding the presence of objects for each cell, hence clustering is required to merge elements in the occupancy grid. As preliminary step, a set of morphological operations (i.e., opening followed by closure) is required to connect areas that might belong to the same object but are not adjacent. This might happen due to obstructions or to the particular shape of the object itself, that caused the number of points belonging to a specific cell to be lower than the filtering threshold explained before. The result is still an occupancy grid where all elements belonging to an obstacle are connected.
Clustering is based on the OpenCV  implementation of SAUF (Scan plus Arraybased Union-Find) ; the output is a list of all the connected components in the occupancy grid which belong to real obstacles, defined by relative position of the respective centre of symmetry (CoS) with respect to the ego-vehicle and its equivalent dimensions . Length and width of each identified object are provided, but the accuracy of those data is limited due to the nature of the input, an obstacle directly in front of the sensor will be visible only on its back, and therefore it will not be possible to compute its length. Thanks to numerous optimization, and the filtering process performed at the beginning of the pipeline this solution is able to run smoothly at (i.e., the maximum frequency of the LiDAR sensor) on a consumer laptop.
V System setup
The developed prototype of autonomous vehicle is shown in Fig. 1, a front wheel drive full-electric plug-in quadricycle, powered by a 6 kW motor. Starting from a commercial vehicle, the modifications made on the most important actuation systems (i.e. the steering, throttle and braking systems, as presented in [30, 29]) allowed to obtain a vehicle fully automated. Therefore, in the current configuration the vehicle takes references from a control unit on which a trajectory planner like the one presented in  runs in real-time. Moreover, the control loop is closed by the estimation algorithm presented in , which provides vehicle pose and speed estimations at running on a consumer laptop.
A second vehicle is required to test and compare the presented algorithms. Thus, this vehicle represents the target of the estimation process, where the ground truth is given by a GPS mounted on its roof to measure its absolute position with RTK correction service. Since the LiDAR sensor is mounted in correspondence of the ego-vehicle center of gravity, a transformation of coordinates involving the current heading angle of the ego-vehicle is required to have the ground-truth in the same local reference system of the LiDAR, i.e. the same in which estimates of the obstacle vehicle are provided. The transformation of coordinates is presented in eq. (1
), where the output vectormeasures the current longitudinal and lateral distance ( and respectively) between the vehicle and the obstacle in the local reference frame, i.e. the right-handed moving reference system of the vehicle. This can be obtained by rotating the vector , which contains the distance between vehicle and obstacle in the absolute reference frame, obtained by converting the GPS measurements from degrees to meters in UTM coordinates.
To conclude, the rotation matrix allows a clockwise rotation accounting for the current angle between the absolute and the local reference frame, i.e. the heading angle of the vehicle ().
Vi Experimental Results
To quantitatively compare the two algorithms we recorded a dataset at the Monza ENI Circuit, using the setup described in the previous section. In particular, the dataset refers to the part of the track between the Serraglio and the Parabolica. This allows to evaluate the performances in a mixed section of the circuit, i.e. where the relative direction between the vehicle and the target obstacle changes frequently. Moreover, the dataset includes two overcoming manoeuvres, hence the two vehicles are changing their respective positions. The algorithms’ performances are evaluated in terms of estimated distance in longitudinal and lateral directions in the vehicle reference frame, and also in terms of predicted sizes of the target vehicle, a commercial van whose width and length are respectively equal to and meters. A visual representation of the system’s output is given in Fig. 5, where the bounding-box created by the Apollo FCNN is superimposed to the target vehicle that is caught on camera as shown in the top-left part of the image.
To compare the results we run both algorithms on the recorded data and compared the computed positions against the ground-truth. A visual comparison of the estimated distances between vehicle and obstacle (i.e., in longitudinal and in lateral direction), is shown in Figure 6. It is possible to notice how the measurement direction impacts on the algorithm performance. Although both algorithms qualitatively follow the ground truth correctly, the geometrical one underestimates the distance with an offset with average value; on the other hand, the estimation performed with FCNN points out positive offset with average with a smaller standard deviation compared to the previous one. For what concerns the different signs of the offsets, the geometrical method provides the distance with respect to the centroid of the 3D points measured by the sensor, hence the fact that the obstacle is higher than the vehicle affects the accuracy of the method, since it cannot account for the real shape of the target. On the other hand, the FCNN used by the other algorithm has been trained with point-clouds obtained from L-shaped vehicles, that explains also why estimates are not available when the ground truth of lateral distance is close to zero, i.e. when target and the vehicle are driving in same direction (as shown in the range ). Thus, the positive value of is due to the fact that the GPS receiver on the target is mounted ahead of the back, hence is located behind the real centroid of the vehicle: this explains why the FCNN gives a higher value of .
About the estimated distance in lateral direction, results in the bottom of Fig. 6 show in general a larger relative error, i.e. the value of the offset compared to the nominal value given by the ground truth. Moreover, as anticipated, the method based on FCNN is not able to provide any obstacle detection in vehicle-following scenarios. Furthermore, as for the estimates of , the fact that the GPS receiver is located meter beyond the lateral edge of the obstacle generates an offset whose sign depends directly on the direction of the lateral displacement between the two vehicles. Moreover the geometric solution seems to perform better at high distance (i.e., above ), and when the obstacle is directly in front of the vehicle, where the number of points representing the van is considerably low and the neural network is not able to extract enough features to detect it. Conversely the deep learning based approach performs considerably better on the lateral position in close distance, where the number of points is high and the network can accurately reconstruct the bounding box.
To conclude, the estimated width and length of the vehicle for each algorithm are shown in Table I. Those values are derived together with the estimates shown in Fig. 6. Even for this comparison, it is possible to state how the results are similar, taking into account that the lower standard deviations obtained by the FCNN-based algorithm can be ascribed to the fact that it does not provide any prediction in scenarios far from the optimal working conditions.
This paper presents a comparison between two algorithms for obstacles state estimation in autonomous driving. The analyzed algorithms take as input 3D point-clouds given by a rotating LiDAR sensor, to estimate the components in lateral and longitudinal directions of the distance between the vehicle on which the sensor is mounted and the target one. Moreover, both algorithms give as output an estimate of the main obstacle dimensions for each detection.
In this paper 3D point-clouds are pre-processed by performing a projection on a 2D plane, that is firstly discretized and then processed, using geometrical and morphological analysis in the first case, or is fed to a FCNN in the other one.
Results point out that the estimates provided by the algorithm based on FCNN and deep learning are in general less affected by noise, but this algorithm does not work properly when the two vehicles are one in front of the other and they are moving in the same direction, i.e. when LiDAR can detect only one edge of the target vehicle. This can be due to the dataset used to train the network. However, this method is more accurate when dealing with the estimation of the main sizes of the vehicle. On the other hand, the presented paper shows how the algorithm that performs geometrical and morphological analysis on the pre-processed point-cloud is more flexible, since it provides estimates in a wider range of working conditions.
-  (2019) 3D obstacle perception. Online. External Links: Cited by: §III.
-  (2019) A survey on 3d object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems 20 (10), pp. 3782–3795. Cited by: §I, §I, §II, §II.
Non-linear mpc motion planner for autonomous vehicles based on accelerated particle swarm optimization algorithm. In 2019 International Conference of Electrical and Electronic Technologies for Automotive, Cited by: §V.
-  (2018) BirdNet: a 3d object detection framework from lidar information. arXiv preprint arXiv:1805.01195. Cited by: §II.
Vehicle state estimation based on kalman filters. In 2019 AEIT International Conference of Electrical and Electronic Technologies for Automotive (AEIT AUTOMOTIVE), Vol. , pp. 1–6. Cited by: §V.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §IV.
-  (2017) Multi-view 3d object detection network for autonomous driving. . Cited by: §II.
-  (2017) Vote3Deep: fast object detection in 3d point clouds using efficient convolutional neural networks. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 1355–1361. Cited by: §II.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §III.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I.
-  (2015) Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision (ICCV). Cited by: §I.
-  (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 37 (9), pp. 1904–1916. Cited by: §I.
-  (2015) Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: item 2.
-  (2016) Vehicle detection from 3d lidar using fully convolutional network. Robotics: Science and Systems. Cited by: §II.
-  (2017) 3D fully convolutional network for vehicle detection in point cloud. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1513–1518. Cited by: §II.
-  (2019) Comparison of 3d object detection based on lidar point cloud. IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), pp. 678–684. Cited by: §II.
-  (2016) SSD: single shot multibox detector. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I.
-  (2015) Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §III.
-  (2019) Multi-layer occupancy grid mapping for autonomous vehicles navigation. In 2019 AEIT International Conference of Electrical and Electronic Technologies for Automotive (AEIT AUTOMOTIVE), pp. 1–6. Cited by: §IV.
-  (2018) LMNet: real-time multiclass object detection on cpu using 3d lidars. arXiv preprint arXiv:1805.04902. Cited by: §II.
-  (2020) Monza eni circuit. Note: https://www.monzanet.it/ Cited by: §I.
-  (2018) Frustum pointnets for 3d object detection from rgb-d data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §II.
-  (2017) PointNet: deep learning on point sets for 3d classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §II.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. Proceedings of the IEEE Conference on Neural Information Processing Systems. Cited by: §II.
-  (2009) ROS: an open-source robot operating system. In ICRA workshop on open source software, Vol. 3, pp. 5. Cited by: §I.
-  (2016) You only look once: unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I.
-  (2017) Faster r-cnn: towards real-time object detection with region proposal networks. Proceedings of the IEEE Conference on Neural Information Processing Systems 39 (6), pp. 1137–1149. Cited by: §I, §I.
-  (2018) Complex yolo: real-time 3d object detection on point clouds. arXiv preprint arXiv:1711.06396. Cited by: §II.
-  (2018-07) Autonomous steer actuation for an urban quadricycle. In 2018 International Conference of Electrical and Electronic Technologies for Automotive, Cited by: §V.
-  (2018-07) On how to transform a commercial electric quadricycle into a full autonomously actuated vehicle. In International Symposium on Advanced Vehicle Control, Cited by: §V.
-  (2009) Optimizing two-pass connected-component labeling algorithms. Pattern Analysis and Applications 12 (2), pp. 117–135. Cited by: §IV.
-  (2018) PIXOR: real-time 3d object detection from point clouds. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §II.
-  (2019-08) A multi-sensor fusion and object tracking algorithm for self-driving vehicles. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering 233, pp. 2293–2300. External Links: Cited by: §I.
-  (2017) Fast segmentation of 3d point clouds: a paradigm on lidar data for autonomous vehicle applications. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5067–5073. Cited by: §IV.
-  (2017) VoxelNet: end-to-end learning for point cloud based 3d object detection. arXiv preprint arXiv:1711.06396. Cited by: §II.