1 Introduction
The success of autonomous vehicles in urban scene heavily relies on the ability to handle the complex environments, where the accurate and robust perception is the foundation. To achieve this, autonomous vehicles are equipped with various sensors, including camera, radar and lidar, in which lidar is considered as the most critical one. The lidar sensor could provide the accurate depth information which is a significant advantage than image and thus lidarbased object detection [36, 13, 40, 37] also achieves greatly better performance than imagebased methods [4, 32, 14, 19]. The mainstream 3D detection frameworks often focus on the singlecategory detection, such as car or pedestrian, while in the real world the autonomous vehicles need to detect multiclass objects simultaneously. In this way, how to distinguish heterogeneous categories plays an indispensable role in the success of multiclass 3D object detection.
A natural idea to handle this challenge is to utilize the difference on appearance or texture to distinguish different objects. Unfortunately, this approach is not feasible for point clouds, due to its pointbased representation lacking of texture or appearance. An appealing alternative is to explore the shape information to guide the discriminative feature learning. Fig. 1 shows an example that demonstrates the shape difference between two categories. From the teaser, we can find that the shape and scale vary with the categories. However, due to the sparsity and noise of the point cloud, how to build the effective and robust shape encoding remains a widely open question.
In this paper, we propose a novel shape signature for shape encoding, which possesses two appealing properties, i.e. compact (effective and short as the objective) and robust (robust against the sparsity and noise). Specifically, as the scan of lidar often covers parts of object (e.g. two or three faces), we first use the symmetry operation to complete the sparse points. Then we project the points to three views of the object, including bird view, side view and front view, for thoroughly modeling the shape information. Furthermore, the convex hull is introduced to represent the shape of three views, making it robust to the innersparsity. Based on the convex hull, we use an angleradius strategy to form the function of convex hull, in which each separate angle corresponds to a radius from innercenter to contour. Finally, to make the shape encoding compacter and more robust, we apply the Chebyshev fitting to perform the approximation on the function of angleradius strategy, and then the final shape encoding is formed by the coefficients of Chebyshev approximation. Note that the proposed shape signature aims to keep the shape information consistent (not same) within the same category and separate the shape distributions across different categories, which enables the shape signature serve as a soft constraint for learning discriminative and classspecific features.
Based on the proposed shape signature, we develop the shape signature networks for multiclass 3D object detection. The basic idea is to incorporate the shape information to better distinguish multiple categories. Specifically, SSN consists of four components, pointtostructure, pyramid feature encoding part, shapeaware grouping heads and shape signature objective. Here, shapeaware grouping heads bring the objects with similar shape together, so as to share weights based on the object size (e.g. bus and truck need a heavier head than car); while shape signature acts as an auxiliary objective, thus benefitting the feature capability of multicategory discrimination.
We tested the proposed framework on two largescale datasets, nuScenes [3] and Lyft [1] dataset, which contain multiclass objects, including car, bus, pedestrian, motorcycle and etc. On these experiments, SSN yields considerable improvement over existing methods, about 10% in NDS and 5
% in mAP3D, respectively. We also make an indepth investigation on the proposed shape signature, showing its good scalability with different backbone networks on different datasets. TSNE visualization of shape signature vector also verifies its role of soft constraint.
The contributions of this work mainly lie in four aspects: (1) We propose a novel shape signature to explicitly explore the 3D shape information from point clouds, which is compact but contains sufficient information, and robust against the noise and sparsity. (2) We develop the shape signature networks (SSN) for object detection from point clouds, which effectively perform the multiclass detection through shapeaware heads grouping and shape signature investigation. (3) We conduct extensive experiments to compare the proposed methods with others on various benchmarks, where it consistently yields notable performance gains. (4) The proposed 3D shape signature could act as a plugandplay component and be independent to the backbone. Experiments on different backbone networks show its good scalability.
2 Related Work
Shape Representation Numerous works processing on this research area have been made in recent decades. Johnson et al. [10] introduced a local shape based descriptors on 3D point clouds called spin images. Based on spin image, Golovinskiy et al. [8] incorporated the contextual features into shape descriptor. While these local descriptors construct encoding resorting to the local neighborhood, global descriptors [2, 6, 11, 21] encode the geometric and structured information of the whole 3D point cloud. IS [9] introduced an implict shape signature for instance segmentation by using the autoencoder to learn a lowdimensional shape embedding space. Viewpoint Feature Histogram (VFH) [27] used the viewpoint direction component and surface shape component to bin the point cloud for shape encoding. However, most of them do not pursue the compact representation and the robustness to the sparsity, which is the major difference between our shape signature and theirs. The proposed shape signature performs the symmetry for completion, convex hull for inner sparsity and Chebyshev fitting for short vector. The cooperation of these operations leads to the compact and robust shape encoding.
3D Object Detection
Most 3D object detection methods can be divided into two groups: imagebased methods and lidarbased methods. For the imagebased methods, the key insight is to estimate the reliable depth information to replace lidar
[32, 14, 19]. Monocular or stereo based depth estimation methods [35] have greatly pushed forward thestateofart in this field. [33] introduced a multilevel fusion method by concatenating the image and generated depth map. [22] incorporated depth features including disparity map and distance to the ground into the detection framework. However, although the imagebased methods have made significant progress, the performance of this type of methods still lags far behind lidarbased methods.Lidarbased methods are the mainstream of 3D detection task as lidar provides accurate 3D information. Most lidarbased methods process the unstructured point input in different representations. In [40, 36, 30], point cloud were converted into voxels and a SSD [20] based convolution network was used for detection. PointPillar [13] used the pillar to encode the point cloud with PointNet [26]. [5, 38, 16, 12, 34, 15] converted point cloud data into a BEV representation and then fed them into the structured convolution network. [28, 39, 24] introduced the twostage detector into 3D detection, where coarse proposals were first generated and then refine stage was used to get the final predictions. [25] used the raw point cloud as input and extracted the frustum region reasoned from 2D object detection to localize 3D objects. However, most of them focus on the singleclass detection, while neglecting to explore the multiclass discrimination. Compared to these works, our proposed method differs essentially in that it effectively explores the shape information, which plays a crucial role in distinguishing multiclass objects.
3 Methodology
3.1 Overview
Given a point cloud, our goal is to localize and classify the multiclass target objects. Unlike the singleclass detectors, we desire to obtain a detector which could effectively distinguish the objects from multiple categories. To this end, we propose a multiclass 3D detection framework based on shape information exploration. The basic idea is to utilize the shape information via two key ingredients, i.e. shape signature objective and shapeaware grouping heads, to benefit the multiclass classification.
As shown in Fig. 3, our framework consists of four components, i.e. pointtostructure, pyramid feature encoding, shapeaware grouping heads and multitask objectives, where pointtostructure and pyramid feature encoding are flexible (i.e. multiple options are available). The key components of SSN are the shape signature objective and the shapeaware grouping heads. Particularly, during the training the shape signature objective could guide the learning of discriminative features via backpropagation, benefitting the multiclass discrimination. After training, the shape signature objective is no longer needed. In what follows, we will present the details of shape signature and SSN.
3.2 Shape Signature
Given the ground truth points of object, we parameterize the shape information of the object with the proposed shape signature, then apply the obtained shape signature vector as a soft constraint to improve the feature capability of multiclass discrimination. As mentioned above, the desired shape signature should carry two properties: 1) compact and effective as a part of objective; 2) robust to the sparsity and noise. To achieve this, we introduce several operations to handle the issue of point clouds. As shown in Fig. 2, the shape signature contains two components, shape completion and shape embedding, where shape completion consists of Transform and Symmetry, and shape encoding involves Projection, Convex Hull, AngleRadius and Chebyshev Fitting.
3.2.1 Shape Completion
Since the scan of lidar sensor only covers the partial observation, this property limits the shape investigation. We thus introduce the shape completion to tackle this issue, which consists of following steps.
Transform. The points of target object are located in the scene. We first transform the center of ground truth box to the origin point, and use the forwarding direction as the reference axis.
Symmetry. Lidar scans could only cover two or three faces of object, thus this partial observation would affect the investigation of shape. We introduce the centrosymmetry to complete the partial view. From Fig. 2 (b), we can find that after symmetry, the points of target object become more dense and the observation gets complete.
3.2.2 Shape Embedding
We then introduce following operations to achieve the compact and effective shape embedding.
Projection. Given the completed points, we project the 3D points to three 2D views, i.e. bird view, front view and side view. Based on the projection, the 3D points are decoupled into several 2D planes, which could thoroughly describe the 3D shape and benefit the reduction of parameters.
Convex Hull. After projection, we get 2D points of different views. However, it can be found that the organization of 2D points is limited to effectively represent the shape and there still exists the innersparsity. Hence, the convex hull is introduced to characterize these 2D points and emphasize the contour of views, thus being robust to the innersparsity. Furthermore, the contour of 2D points also maintains the scale information, which is an important factor for multiclass discrimination (see Fig. 2 (d)).
AngleRadius. To describe the convex hull and highlight the contour shape and scale, we design an angleradius parametric function . We use the center of ground truth box as the origin point and densely sample some angles . In this way, the function , where is the convex hull and indicates the distance between origin point and intersection point (i.e. radius). In the implementation, we sample 360 angles and calculate the radius accordingly.
From Fig. 2 (e) (see the aspect ratios), it is noted that the function involving the angle and radius does well in maintaining the shape and scale of contour. However, the dense sampling also introduces the long vector (360 dimensions) which is not desired for the objective. Hence, to shorten the long vector representation and further enhance the robustness against the noise (e.g.
some outliers in the 2D points), we introduce the Chebyshev Fitting to process the angleradius function
.Chebyshev Fitting. Chebyshev Polynomials Fitting [23] provides an approximation that is close to the polynomial of best approximation to a function under the maximum norm. Our goal is to apply the Chebyshev polynomials to approximate the angleradius function, and then use their coefficients to serve as the final shape vector.
There are two kinds of Chebyshev polynomials fitting [23], and we use the Chebyshev polynomials of first kind. The first kind is defined by the recurrence relation:
(1)  
(2) 
Hence, the generic formulation of Chebyshev approximation can be written as a sum of .
(3) 
where are the coefficients. These coefficients can be computed with the formulas:
(4)  
(5) 
Since the number of coefficients in is , we truncate with top terms. For each view, top coefficients are the shape vector. The final shape signature is . In the implementation, we use =3 and the dimension of final shape signature vector is 9, which is suitable to serve as an objective for the network.
Some Extreme Cases. Due to the limitation of Lidar sensor and human annotators, some ground truth boxes contain less than or equal to 5 points, even 0 point for incorrect labeling. For these boxes, it is hard to model the shape information, and we thus use the average encoding of that category to represent their shape vectors.
3.3 SSN: Shape Signature Networks
Based on the proposed shape signature, we design the SSN to achieve the effective multiclass 3D detection. We first describe each component, especially two key ingredients, i.e. shapeaware grouping heads and shape signature objective, then we integrate different parts to form the unified target: exploring the shape information to better distinguish multiclass objects.
PointtoStructure. Since the organization of point cloud is unstructured, the first step is to transform the point cloud to the structured representation. As mentioned above, multiple options are available in this part, such as the voxelbased [36, 40] representation or pillarbased [13] or Birdview representation [5]. After obtaining the structured representation, the subsequent 2D convolution or 3D convolution networks can be applied. In the implementation, we choose the pillarbased representation to structure the point clouds. Furthermore, we also test the shape signature with other structure representation (voxelbased) and the proposed shape signature shows good scalability.
Pyramid Feature Encoding. We follow the idea of FPN [17] to perform the feature encoding. A topdown convolutional network is first applied to extract the feature from multiple spatial resolutions. Then all features are fused together through upsampling and concatenation.
Shapeaware Grouping Heads. Since multiclass target objects vary significantly in scale and shape, we propose the shapeaware grouping heads to adapt this ideology for multiclass discrimination. The basic idea is to create multiple heads, in which objects with similar scale and shape share the weights. The reasons mainly lie in the following: 1) objects with different scale and shape should have different heads. For example, the head of bus needs to be heavier (or more deep) than the head of bike due to its large scale, because heavier head, larger receptive field. 2) shape grouping heads could perform the coarse shape exploration and also alleviate the effect from other groups.
As shown in Fig. 3, the design of shapeaware grouping heads follows the spirit of “larger object, heavier head”. Based on the shape and scale of target objects, we group the bus, truck and trailer together with a heavier head, and gather bicycle and motorcycle with a lighter head, and treat the car with a medium head. Each head only covers the prediction of corresponding categries. By integrating above components, a SSDbased detection framework is formed.
3.4 Multitask Objectives
In our framework, there are three objectives, i.e. multiclass classification, localization regression and shape vector regression. For the multiclass classification, we follow the previous work [36] to use the focal loss [18]
(6) 
where
is the class probability of the default box and we use
and .For the localization loss, we use the smooth L1 loss to minimize the distance between predictions and localization residuals [36].
(7) 
where are the localization residuals, including the center (), scale () and rotation ().
Unlike regressing the residuals in localization, the network is trained to directly regress the shape vector. For the shape regression, we also apply the smooth L1 loss.
(8) 
where is the shape vector.
The total objective of three tasks is therefore:
(9) 
where are the constant factors of loss terms. As the shape loss is much larger than localization and classification loss, we set , and to balance the value scale.
4 Experiments
4.1 Datasets
Two largescale datasets, nuScenes dataset and Lyft dataset, are applied in experiments. The details of two datasets are shown in the following.
NuScenes Dataset [3] It collects 1000 scenes of 20s duration with 32 beams lidar sensor. The number of total frames is 40,000, which is sampled at 2Hz, and total 3D boxes are about 1.4 million. 10 categories are annotated for 3D detection, including Car, Pedestrian, Bus, Barrier, and etc.(details in the experimental results). They also officially split the data into training and validation set, and the test results are evaluated at EvalAI^{2}^{2}2https://evalai.cloudcv.org/web/challenges/challengepage/356/overview. Furthermore, a new metric is also introduced in nuScenes dataset, namely nuScenes detection score (NDS) [3], which quantifies the quality of detections in terms of average classification precision, box location, size, orientation, attributes, and velocity. The mean average precision (mAP) is based on the distance threshold (i.e. 0.5m, 1.0m, 2.0m and 4.0m). The whole range is about 100 meter, and we mainly use the range of 050m in full 360 degree. More detailed descriptions are shown in Supplementary Materials.
Lyft Dataset [1] It contains one 40beam roof lidar and two 40beam bumper lidars, and in the experiments, we only use the data from roof lidar. The data format is similar to the nuScenes dataset. Total 9 categories are annotated for detection, including car, emergency_vehicle, motorcycle, bus, truck, and etc.. Total 22,680 frames are used as the training data, and test set contains 27,468 frames while 30% of the test data is for validation in Kaggle competition^{3}^{3}3https://www.kaggle.com/c/3dobjectdetectionforautonomousvehicles
. The evaluation metric is the mean average precision, which is similar to the metric of COCO dataset but calculates the 3D IoU (with the threshold of 0.5, 0.55, 0.6, 0.65, …, 0.95). Hence, we name it as
mAP3D, and it is worthy to note that mAP3D is much strict than mAP in nuScenes and Kitti [7].4.2 Implementation Details
In our implementation, we use the pillar based [13] method to convert the point cloud to the structured representation. For nuScenes dataset, the x, y, z range is ([49.6, 49.6], [49.6, 49.6], [5, 3]) and the pillar size is [0.2, 0.2, 8]. The max number of pillars is 30,000 and max number of points per pillar is 20. For Lyft dataset, the range is ([89.6, 89.6], [89.6, 89.6], [5, 3]) and the pillar size is [0.2, 0.2, 8] too. The max number of pillars is 60,000 and max number of points per pillar is 12.
For the anchors, we calculate the mean width, length and height of each class and use birdview 2D IoU (width and length) as the matching metric; when the matching between anchors and ground truth is larger than the positive threshold, these anchors are positive, otherwise if the matching is smaller than negative threshold, they are negative anchors. The matching threshold is different for different categories. During inference, the multiclass and rotational NMS is employed, where multiclass NMS indicates applying NMS for each class independently. For a fair comparison, no multiscale training / testing, SyncBN and ensemble are applied. For nuScenes dataset, online ground truth sampling [36, 41] is not used. We also submit the results on these official websites ^{4}^{4}4https://www.nuscenes.org/objectdetection^{5}^{5}5https://www.kaggle.com/c/3dobjectdetectionforautonomousvehicles (Our submissions are “gezi” and “OIDH” respectively, and both of them are anonymous.).
Network Details For the pointtostructure, we follow the network in [13]
, where a simplified PointNet is used. It contains a linear layer, BatchNorm and ReLU layer to handle the features of pillars. For the CNN feature encoding, the FPN based module is introduced to extract the fused features. Three levels of features are first upsampled with the transposed 2D convolution, and then concatenated. For the shapeaware grouping heads, objects with similar shape and scale share the same head. For bus, truck and trailer, a heavier head is applied, where two downsample blocks process the features from FPN. Each downsample block consists of 3x3 2D convolution layer with stride=2, followed by BatchNorm and ReLU. For the lighter head (such as bicycle, motorcycle), the block with stride=1 is used. For the medium head, one downsample block is applied. Note that another block with stride=1 is followed in each downsample block. More detailed network structure is shown in Supplementary Materials.
Optimization
We use the Adam optimizer with cycle learning decay. The maximum learning rate is 3e3 and weight decay is 0.001. We train 60 epoches and 80 epoches for nuScenes dataset and Lyft dataset, respectively; the batch size is 2 for nuScenes and 1 for Lyft dataset.
4.3 Results
Methods  Modality  Car  Truck  Bus  Trail  CV  Ped  MC  Bicy  TC  Bar  mAP  NDS 

Mono [29]  RGB  47.8  22.0  18.8  17.6  7.4  37.0  29.0  24.5  48.7  51.1  30.4  38.4 
Second [36]  Lidar  73.1  25.2  30.5  31.5  8.5  59.3  21.7  4.9  18.0  43.3  31.6  46.8 
PP [13]  Lidar  68.4  23.0  28.2  23.4  4.1  59.7  27.4  1.1  30.8  38.9  30.5  45.3 
Painting [31]  Lidar&RGB  77.9  35.8  36.1  37.3  15.8  73.3  41.5  24.1  62.4  60.2  46.4  58.1 
SSN  Lidar  80.7  37.5  39.9  43.9  14.6  72.3  43.7  20.1  54.2  56.3  46.3  56.9 
Results on nuScenes dataset. In this experiment, we test our model on nuScenes dataset and report the performance on the test set from official evaluation server. The results are shown in Table. 1. We give the detailed AP of each category and other metrics. It can be found that SSN achieves about 15% improvement in mAP and 10% in NDS compared to these lidarbased methods, even for some small objects, such as pedestrian and traffic cone. Even compared with the Lidar&RGB fusion method [31], our lidarbased model also achieves comparable performance and performs better in the main categories of traffic scenarios, such as Car, Truck, Bus and Motorcycle, etc. Note that the results of PointPillar and Painting [31] are copied from the original papers and for Second, we reimplement it under our setting and hyperparameters are followed with SSN. For bicycle, due to its sparsity and low height, it is difficult to specify in the point cloud while it can be accessed in the image, thus the result of Bicycle in image detection is better than the 3D detection.
Results on Lyft dataset. For Lyft dataset, there is no official split of training set and validation set. Hence, we report the results on Kaggle competition (30% test data is used for public validation but the host does not provide the ground truth. We submit the outputs of SSN and our baseline model to obtain the results). As Lyft dataset is a very new dataset, there is no official implementation. We reimplement PointPillar and Second to perform experiment on Lyft dataset, and optimization method and anchor matching strategy follow the SSN. Table. 4.4 shows the results of SSN and other existing methods on the test set. SSN consistently achieves the better performance with about 5% improvement compared to existing methods. Due to the strict metric (mAP3D under IoU 0.5 to 0.95), the result on Lyft dataset is lower than nuScenes. Note that we only report the results of single model with singlescale training. The result on the official websites is 18.1% which is applied with multiscale training.
Qualitative Analysis. We show several samples from the challenging Lyft dataset in Figure. 4. For ease of interpretation, we show the 3D boxes from the BEV perspective. It can be found that the car, bus and other vehicle achieve the decent performance. Some false positives and missing objects appear on the far range (about 50m). TSNE visualization. We use the TSNE to visualize the distribution of shape signature in Figure. 5. Four categories in nuScenes, including Car, Truck, Motorcycle and Ped, are sampled to display for a clearly visual effect. We sample 50 instances for each category, where 25 of them are with distance 40 meters and others are with distance 40 meters. It can be observed that the discrepancy across different classes is clear, which indicates the capability of our shape signature to separate the shape distribution across different categories. Meanwhile, the distribution of shape signature within the same class differs with different distance (points with distance 40m and points with distance 40m cluster at different regions accordingly), which demonstrates the shape signature acts as a soft (not hard) constraint and keeps the shape distribution consistent (not same).
4.4 Ablation Studies
In this section, we perform the thorough ablation experiments to investigate the effect of different components in our method, including shapeaware grouping heads and shape signature, the scalability of the proposed shape signature with various backbone networks, and comparison with other shape signature.
Effect of Different Components. In this experiment, we choose the PointPillar as the backbone, and perform the ablation study by adding the components stepbystep. Due to the limited submissions in the evaluation server, we report the results on the official validation set of nuScenes dataset. As shown in Table. 4.4, it can be found that two key components, shapeaware grouping heads and shape signature, achieve the significant performance gain, with 6.4% and 5.7% improvements in NDS respectively, which demonstrates that the shape information does improve the multiclass detection.
Scalability of Shape Signature. To investigate the scalability of the proposed shape signature, we perform a thorough study, where the shape signature is combined with different backbone networks and tested on different datasets. The detailed results are shown in Table. 4. For different backbone networks, we use PointPillar and Second, which utilize the 2D convolution and 3D convolution networks, respectively, and cover the mainstream in 3D object detection. It can be found that the shape signature could greatly improve the performance for different backbone networks on various datasets. Furthermore, it also achieves the consistent performance gain across different datasets, i.e. nuScenes, Lyft and Kitti [7] dataset. Note that the mAP3D in Lyft is similar to COCO dataset, which is much difficult than mAP in nuScenes and Kitti. From these two perspectives, we can find that the proposed shape signature does possess good scalability and the exploration of shape information does improve the capability of detection networks in the discrimination of multiple categories.
Dataset  Methods  mAP  NDS  Dataset  mAP3D  Dataset  mAP@car  mAP@ped 
nuScenes  PP[13]  29.4  44.9  Lyft  13.4  Kitti  74.3  41.9 
SS  36.6  49.8  16.2  76.2  43.5  
Second[36]  31.1  46.9  13.0  73.7  42.6  
SS  34.3  48.9  15.4  75.4  44.1 
Methods  PP [13]  PP OtoO Heads  PP SG Heads  PP IS[9]  PP SS 

mAP  29.4  32.0  39.1  31.4  36.6 
NDS  44.9  46.2  51.0  46.7  49.8 
Shapeaware Grouping Heads v.s. OnetoOne Heads. To verify the effectiveness of the shapeaware grouping heads, we compare the shapeaware heads to the onetoone heads, in which each head covers one category. The difference between two types of heads is the shape information investigation. From the results shown in Table. 5, it can be found that the shapeaware grouping heads perform much better than onetoone heads in both metric terms, which further demonstrates the shape information benefits the multiclass discrimination. Moreover, the shape grouping strategy is also more effective than the onetoone strategy, which groups the objects with similar shape and scale to aid the exploration of shape information.
Comparison with other Shape Signature. The previous work [9] provides an implicit shape representation for instance segmentation. We adapt this approach into the point cloud segmentation and obtain the implicit shape signature with same dimension (“IS” is the notation). We compare the “IS” with our shape signature (“SS”) in Table. 5. It can be found that our shape signature outperforms the implicit shape signature with a large margin because “SS” better handles difficulties from point cloud by completion and robustness enhancement. Dimension of Shape Signature. We use top 3 coefficients of Chebyshev approximation, because they principally and effectively cover the shape function. For example, for the birdview shape vector of a car (we show full coefficients), [1.93, 0.65, 0.083, 4.68e03, 1.064e05, ], it can be found that top 3 coefficients contain the main knowledge and are appropriate as objective.
5 Conclusion
In this paper, we design a novel shape signature which acts as a soft constraint, and thus aid the feature capability of multiclass discrimination. Two appealing properties are carried, i.e. compact and effective as the objective and robust against the sparsity and noise. Based on the proposed shape signature, we develop the shape signature networks for object detection from point clouds, which makes use of shape information to promote the multiclass detection, through shapeaware heads and shape signature objective. We conduct extensive experiments and ablation studies, which demonstrate our model achieves stateoftheart and the proposed shape signature keeps good scalability on various backbones.
References
 [1] level 5 dataset, L.: https://level5.lyft.com/dataset/, https://level5.lyft.com/dataset/
 [2] Belongie, S., Malik, J., Puzicha, J.: Shape context: A new descriptor for shape matching and object recognition. In: Advances in neural information processing systems. pp. 831–837 (2001)
 [3] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)

[4]
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3d object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2147–2156 (2016)
 [5] Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multiview 3d object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1907–1915 (2017)
 [6] Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing objects in range data using regional point descriptors. In: European conference on computer vision. pp. 224–237. Springer (2004)
 [7] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 1231 – 1237 (2013)
 [8] Golovinskiy, A., Kim, V.G., Funkhouser, T.: Shapebased recognition of 3d point clouds in urban environments. In: 2009 IEEE 12th International Conference on Computer Vision. pp. 2154–2161. IEEE (2009)
 [9] Jetley, S., Sapienza, M., Golodetz, S., Torr, P.H.S.: Straight to shapes: Realtime detection of encoded shapes. CVPR pp. 4207–4216 (2016)
 [10] Johnson, A.E., Hebert, M.: Surface matching for object recognition in complex threedimensional scenes. Image and Vision Computing 16(910), 635–651 (1998)
 [11] Kasaei, S.H., Tomé, A.M., Lopes, L.S., Oliveira, M.: Good: A global orthographic object descriptor for 3d object recognition and manipulation. Pattern Recognition Letters 83, 312–320 (2016)
 [12] Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1–8. IEEE (2018)
 [13] Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12697–12705 (2019)
 [14] Li, P., Chen, X., Shen, S.: Stereo rcnn based 3d object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7644–7652 (2019)
 [15] Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multitask multisensor fusion for 3d object detection. CVPR pp. 7337–7345 (2019)
 [16] Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multisensor 3d object detection. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 641–656 (2018)
 [17] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)
 [18] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
 [19] Liu, L., Lu, J., Xu, C., Tian, Q., Zhou, J.: Deep fitting degree scoring network for monocular 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1057–1066 (2019)
 [20] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European conference on computer vision. pp. 21–37. Springer (2016)
 [21] Marton, Z.C., Pangercic, D., Blodow, N., Beetz, M.: Combined 2d–3d categorization and classification for multimodal perception systems. The International Journal of Robotics Research 30(11), 1378–1402 (2011)

[22]
Pham, C.C., Jeon, J.W.: Robust object proposals reranking for object detection in autonomous driving using convolutional neural networks. Signal Processing: Image Communication
53, 110–122 (2017) 
[23]
polynomials, C.: https://en.wikipedia.org
/wiki/chebyshev_polynomials, https://en.wikipedia.org/wiki/Chebyshev_polynomials  [24] Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. arXiv preprint arXiv:1904.09664 (2019)
 [25] Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgbd data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 918–927 (2018)

[26]
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 652–660 (2017)
 [27] Rusu, R.B., Bradski, G., Thibaux, R., Hsu, J.: Fast 3d recognition and pose using the viewpoint feature histogram. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 2155–2162. IEEE (2010)
 [28] Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–779 (2019)
 [29] Simonelli, A., Bulò, S.R.R., Porzi, L., LópezAntequera, M., Kontschieder, P.: Disentangling monocular 3d object detection. arXiv preprint arXiv:1905.12365 (2019)
 [30] Tai, W., Xinge, Z., Dahua, L.: Reconfigurable voxels: A new representation for lidarbased point clouds. arXiv preprint arXiv:2020 (2020)
 [31] Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for 3d object detection. ArXiv abs/1911.10150 (2019)
 [32] Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudolidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8445–8453 (2019)
 [33] Xu, B., Chen, Z.: Multilevel fusion based 3d object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2345–2353 (2018)
 [34] Xu, D., Anguelov, D., Jain, A.: Pointfusion: Deep sensor fusion for 3d bounding box estimation. CVPR pp. 244–253 (2017)
 [35] Xu, Y., Zhu, X., Shi, J., Zhang, G., Bao, H., Li, H.: Depth completion from sparse lidar data with depthnormal constraints. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2811–2820 (2019)
 [36] Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
 [37] Yang, B., Liang, M., Urtasun, R.: Hdnet: Exploiting hd maps for 3d object detection. In: Conference on Robot Learning. pp. 146–155 (2018)
 [38] Yang, B., Luo, W., Urtasun, R.: Pixor: Realtime 3d object detection from point clouds. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 7652–7660 (2018)
 [39] Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: Std: Sparsetodense 3d object detector for point cloud. arXiv preprint arXiv:1907.10471 (2019)
 [40] Zhou, Y., Tuzel, O.: Voxelnet: Endtoend learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4490–4499 (2018)
 [41] Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Classbalanced grouping and sampling for point cloud 3d object detection. ArXiv abs/1908.09492 (2019)