1 Introduction
Selfdriving vehicles, also known as an autonomous vehicle (AV) will definitely impact evolving transportation systems. As the “visual cortex” of the autonomous vehicles, 3D detection algorithms have gained attentions from both the research and industry communities. By representing spatial data as a collection of coordinates, i.e. geographic point clouds, we can construct the world that is analogous to the real world in three dimensions. LiDAR (Light Detection And Ranging) laser scanners are the most common instruments used to collect these geographic point clouds. Owing to selfocclusion, occlusion, reflection or undesirable weather conditions etc., LiDAR point clouds tend to be sparse and irregular. Due to the sparseness and disorderly nature of this data, until now, no effective processing algorithms are proposed. Meanwhile, the cost of LiDAR are another barrier to its massive adoption on selfdriving cars. It’s desired for 3D detection algorithms to work with low beam density LiDAR point clouds which is a big challenge for current algorithms as well.
To tackle the existing challenges of LiDAR point clouds, most existing methods strive to organize points on individual objects together, and define objectlevel anchors that predict offsets of 3D bounding boxes using collective evidence from all the points on the objects of interest. VoxelNet [39] proposes a voxel feature encoder made of a MLP that aggregates sampled point features in a voxel and extracts voxel features with standard Full Convolution Network (FCN), which consists of 3D convolutions and Region Proposal Network (RPN). These voxelbased approaches such as VoxelNet [39], along with its succesors, SECOND [33] and PointPillar [14] require hyperparameters, such as anchor ranges, anchor sizes and orientations to set anchors as well as the IOU matching to assign ground truths. However, predefined anchors requires the prior knowledge about the statistical size and orientation of objects etc. Additionally, in common practice, the number of anchors grows linearly with the number of object classes which introduces the extra burden. Furthermore, the IOU matching to assign the ground truths decreases the number of positive samples which aggravate the imbalance of negative and positive sample ratio.
Do we really need the predefined objectlevel anchors to infer the 3D location and orientation of the object? According to our observations, the dense 3D shape of objects is never captured by LiDAR point clouds. In some extreme cases, less than 10 points from one object are sensed. We thus argue in this paper for an approach opposite to existing methods using objectlevel anchors. Inspired by compositional partbased models [10, 40, 6, 4, 12]
, which have been shown to be robust when classifying partially occluded 2D objects and for detecting partially occluded object parts
[37], we propose to detect objects in LiDAR point clouds by representing them as small cliques of points. We take these cliques of points as hotspots which will be described and discussed in Sec. 3.2. We adopt voxelization which partitioning points into cliques as our input. At the same time, we proposed a new objectashotspots (OHS) module to handle the voxels and generate 3D information. Based on OHS, we propose a Hotspot Network (HotspotNet) that performs 3D object detection via firing of hotspots without setting the predefined bounding boxes. More specifically, we first partition the 3D space as regular voxels, and aggregate features of individual points inside each voxel to form the hotspot representation; these hotspots are trained to directly predict objects’ location and orientation; final results are obtained by applying nonmaximumsuppression to the predictions of individual hotspots. The simplified concept of HotspotNet and visualization of hotspots in BEV are shown in Fig. 1.The main contributions of our proposed method can be summarized as follows:

We propose a novel representation of point clouds, termed Object as HotSpot (OHS) to effectively handle the data with sparsity and irregularities, e.g. LiDAR point clouds. At the same time, we propose two hotspot assignment methods and test the corresponding performance.

Based on OHS, we architect the HotSpot Network (HotSpotNet) for LiDAR point clouds and this is the first onestage and anchorfree 3D detection method which achieves the stateoftheart performance.

We propose a combined training loss to deal with the large variance of 3D bounding box regression without the constraint of anchors and train HotSpotNet accurately and effectively. Among them, quadrant classification loss is proposed to learn the inherent and invariant objectpart relation which enables HotSpotNet to obtain the 3D information accurately and efficiently.

Benchmark on KITTI shows the significance of our proposed method, particularly on pedestrian detection, our proposed method beats all the existing methods, ranking 1st on KITTI test dataset. Meanwhile, extensive experimental results on pesudo 32 beam LiDAR point clouds further verify our assumption on efficacy of handling sparse data using our proposed algorithm.
2 Related Work
Currently, there are four types of point clouds input for 3D detection algorithms. Raw point clouds [27, 36] Projection [34, 28, 1, 5, 21] Usually, they project the 3D LiDAR point clouds into Bird’s Eye Views (BEV) or range views Voxelization [33, 39, 14] Generally, they divide the raw point clouds into regular grids and feeding the regular grid to the algorithms to infer 3D information. Mixture of representations [3, 19] These methods fuse raw point clouds and voxels on different stages of the networks. Different algorithms may consume different types of input, in this paper, we adopt voxelization representations.
Below, we will briefly review related works from onestage and twostage 3D object detection, and then we emphasize possibly related anchorfree 3D object detection.
OneStage 3D Detection Similar to onestage 2D detectors, onestage 3D detectors process the input data once to obtain the 3D bounding boxes. Onestage methods always adopt sliding window mechanism, therefore, constructing a contiguous and regular feature representation is important. VoxelNet [39] and SECOND [33] use the 3D voxel representation and 3D fully convolutional network as backbone network. PointPillar [14] carefully designs pillar shape to organize point cloud to 2D pseudoimage in order to avoid the use of 3D convolution. For these methods, FPN [17] is a commonly used network architecture because it contains rich semantics from all levels. [34, 16] concatenate the channels from a voxelized BEV and the height information of point clouds to form a new input representations based on which they try to obtain the 3D information. [28, 1, 5] try to extend the one stage 2D detector YOLO [25] to high dimensional detectors for point clouds.
TwoStage 3D Detection Different from onestage methods, twostage methods have a proposal stage in the first to generate plausible candidate regions. MV3D [2] projects LIDAR point clouds to BEV, and then employs a FasterRCNN [26] on this BEV to obtain the 3D object cue. AVOD [13] extends MV3D by aggregating multimodal features to generate more reliable proposals. These methods always treat point clouds after projection in a similar fashion as RGB images and use traditional 2D detection pipelines under multimodal setting. [27, 36] generate proposals for each foreground point in the entire space after semantic segmentation of point clouds. [23, 32] leverage some mature 2D image detectors to provide frustum proposals. [3] adopts onestage VoxelRPN to provide initial prediction. Most of twostage methods have relied heavily on design of predefined anchors.
Because of the complex design and low FPS (frame per second) of the twostage methods, our algorithm is designed based on the onestage 3D detection algorithm, but our design can be applicable to twostage detection as well.
AnchorFree 3D Detection To our knowledge, there are no existing anchorfree 3D detectors for LiDAR based point clouds. Some algorithms without anchor regression are proposed for indoors scenes. SGPN [31] segments instances by semantic segmentation and learning a similarity matrix to group points together. This method is not scalable since the size of similarity matrix grows quadratically with the number of points. 3DBoNet [35] learns bounding boxes to provide a boundary for points from different instances. Unfortunately, both methods will fail when only partial point clouds have been observed, which is common in LiDAR point clouds. VoteNet [22] generates seed points from PointNet++ [24], and independently generates votes through shared voting module which is made of MLP and then refine the clustered points to obtain box proposals. Similar to us, they think each point in the point cloud contribute to the 3D geometry reconstruction. Differently, they take advantage of these information to generate the centroid, while ours are directly regressing the geometry information, i.e. 3D bounding boxes. Though with some anchorfree flavor, VoteNet is not strictly anchorfree because it uses anchor boxes for the size regression, similar to PointRCNN [27].
3 Object as Hotspots
3.1 Interpretation of Hotspots
Partbased models are known to be robust to occlusion. When some parts are occluded, the rest are still able to provide hints for semantic information of objects. Likewise, we represents LiDAR scanned objects as a composition of multiple hotspots. Each hotspot individually predicts its likelihood of the object presence so that when an object is mostly occluded or partially scanned, some hotspots are still able to indicate the presence of the object and contribute to the 3D geometry information. Intuitively, hotspots are small cliques of LiDAR points captured on objects. For the voxelization representations, a voxel with several points enclosed is a sample of hotspot. A neuron on the CNN feature maps can represent a collection of voxels whose locations can be projected to the corresponding location of the neuron on the feature map. A neuron has more representation power than a voxel input. During training, neurons on the feature map are assigned as hotspots and nonhotspots for each object category are trained by a binary classifier. In inference, a neuron on a feature map is fired as a hotspot if it gives high confidence being part of the objects.
3.2 Hotspot Assignment
In 3D world, usually, objects are rigid which do not coincide with each other. Therefore, for each annotated groundtruth bounding box, it should contain the point clouds from only one object. We can take advantages of the annotated bounding box to determine hotspots which will lie on the objects. Given an object instance in the point clouds, the annotation defines a bounding box to indicate the object location, where is the category index, is the center of the box, is the dimensions of the box and is the rotation angle around zaxis in radius in LiDAR coordinates.
Due to the labeling error, boundary points could lie in a confusing area between object and background. These points definitely can not contribute to the final regression. In the consequence, we define an effective box of so that points within the effective box are all highconfident hotspots. We also define an ignoring box outside the effective boxes to be a soft boundary between object and background. and are the ratio to control the effective region and ignoring region and . Points outside the effective box but inside the ignoring box will not do backpropagation during training. Points outside the ignoring box are all nonhotspots.
To decide whether a neuron on the feature map is representing the hotspot(s) or not, we have two alternative methods. The direct way is to project a collection of nonempty voxels within the effective bounding boxes, to the corresponding location of the neuron on the feature map. Alternatively, we can project all the voxels within the effective bounding boxes, including the empty and nonempty voxels from velodyne coordinates to the feature map. In this assignment, we assume empty voxels assist to acquire the 3D geometry information which can be trained as hotspots as well. We term the two variants of our network according to the hotspots assignment that either nonempty voxel projection or all voxels, including the nonempty and empty ones projection as HotSpotNetDirect and HotSpotNetDense, respectively.
4 HotSpot Network
Hotspot Network (HotSpotNet) consists of a 3D feature extractor and ObjectasHotspots head. OHS head has three subnets for hotspot classification, box regression and quadrant classification. Though we use voxelbased CNN backbone here, our OHS head can actually be stacked on top of any LiDAR point clouds based methods including raw point cloud based methods, e.g. PointNet++ [24].
The whole architecture of our proposed HotSpotNet is shown in Fig. 2. The input LiDAR point clouds are voxelized into regular grids. The regular grids pass through the 3D CNN to generate the feature maps. The three subnets will guide the supervision and generate the predicted 3D bounding boxes. Hotspot assignment happens at the last convolutional feature maps of the backbone. The details of network architecture and the three subnets for supervision will be described below.
4.1 ObjectasHotspots Head
Our OHS head network consists of three subnetworks: 1) a hotspot classification subnetwork that predicts the probability of each hotspot being part of an object category; 2) a box regression subnetwork that regresses the center locations, dimensions and orientations of the 3D boxes. 3) a quadrant classification subnetwork that predicts the quadrant each hotspot will fall into with regard to the local object coordinate originated around the object center.
4.2 Hotspot Losses for Supervision
Hotspot Classification The classification module is a convolutional layer with heatmaps and each heatmap corresponds to one category. The hotspots will be labeled as one. The points fall in the ignoring region will be ignored and do not contribute to backpropogation. The targets for all the nonhotspots will be zero. Binary classification will be applied for hotspots and nonhotspots. Focal loss [18] is applied at the end,
(1) 
where,
where is the output probability, is the number of categories. The total classification loss of one instance is the summation of the focal loss over all effective and negative regions, normalized by the total number of hotspot and nonhotspot units (excluding ones in the ignoring region).
Box Regression
The bounding box regression only happens on the active hotspot features. For each hot spot feature, a eightdimensional vector
is regressed to represent the object instance in LiDAR point clouds. are the deviations of hotspot (colored in red) on the feature map to the instance centroid, illustrated in Fig. 3. The voxel centroid in velodyne coordinates in BEV can be obtained by:(2) 
where is the spatial index of a hotspot unit on the feature map, is the corresponding centriod in velodyne coordinates based on which we perform regression, and , are the ranges for when we voxelize all the points.
Unlike anchorbased methods where there are predefined normalization factors, i.e., anchor box dimensions to regularize the training targets and stabilize training, our HotSpotNet will easily introduce training imbalance without predefined normalization factors, due to scale variances of the object dimension (both inter or intra object classes). 2D anchorfree detectors often utilizes FPN [17] to alleviate scale variances. Instead of introducing extra layers and computation overhead, e.g. adding FPN, to our network, we tackle scale variances by carefully designing the targets. We regress , , instead of their original values because scales down the absolute values. We regress , instead of directly because they can strictly constrain the rotation to obtain the unique solution. What’s more, we use soft proposed in [11] to help to regress , and . To regress a point location in a segment ranging from to by soft , we divide the segment into bins, each bin accounting for a length of . The target location can be represented as , where represents the softmax score of the bin and is the center location of the bin. The soft turns the regression into classification problem and avoids regressing direct values. We find the choices of do not affect the performance of our approach as long as they cover the ranges of target values.
Smooth L1 loss [8] is adopted for regressing these bounding box targets.
(3) 
We only compute the regression loss for locations over all effective box regions.
Quadrant Classification Our HotSpotNet predicts the axisaligned deviations from hotspots to object centroids, i.e.
. This encoding does not show the inherent relation between hotspots and object centroids since the deviations will vary with object orientations. We want our model to learn the inherent and invariant objectpart relation so we introduce another supervision signal for coarse estimation of the hotspots. For each hotspot within effective box regions, we categorizes the relative hotspot location to the object center (in BEV) into quadrants, as show in Fig.
3. We find quadrant classification helps our HotSpotNet converge faster. We train our quadrant classification subnetwork with binary crossentropy loss and we compute the loss only for hotspots.(4) 
where indexes the quadrant, is the target and the predicted likelihood falling into the specific quadrant.
4.3 Learning and Inference
The final loss for our proposed HotSpotNet is the weighted sum of losses from three branches:
(5) 
Where, , and are the weights to balance the classification, box regression and quadrant classification loss.
During inference, if corresponding largest entry value of the dimensional vector of the classification heatmaps is above the threshold, we consider the location as hotspot firing for the corresponding object. For hotspots, it is straightforward to decode the associated predicted boxes from to the canonical representation . Since one instance might have multiple hotspots, we further use NMS with the Intersection Over Unit (IOU) threshold to pick the most confident hotspot for each object. The quadrant classification branch does not contribute to inference.
5 Experiments
In this section, we summarize the dataset in Sec 5.1 and present the implementing details of our proposed HotSpotNet in 5.2. In Sec 5.3 we evaluate our method on the challenging 3D detection Benchmark, KITTI [7]. We also compare our anchorfree approach with baseline on the psudo 32beam KITTI dataset in Sec 5.4 and present ablation studies in Sec 5.5. In Sec 5.6, we illustrate visualization results.
5.1 Dataset and Evaluation
Dataset KITTI has 7,481 annotated LiDAR point clouds for training with 3D bounding boxes for object classes such as cars, pedestrians and cyclists. Additionally, KITTI provides 7,518 LiDAR point clouds without labeling for testing. In the rest of paper, without explicitly noting, all the experiments are running on the common train/val split, i.e. 3712 LiDAR point clouds for training and 3769 LiDAR point clouds for validation. The performance is reported on validation data. To further compare the results with other approaches on KITTI 3D detection benchmark, we randomly split the KITTI training data into for training and validation and report the performance on KITTI test dataset.
To further verify our proposed method can better tackle natural sparsity of point clouds, we generate pseudo 32 beam KITTI downsampled from the original 64 beam KITTI. We observe that the point elevations are densely distributed within the range of and sparsely distributed within the range of and . In consequence, we uniformly divide , ,
into 4, 56, 4 bins, respectively and sample the points from 32 bins with stride 2 to simulate the 32 beam point clouds, resulting in two sets of pseudo 32beam KITTI datasets. The comparing visualization of 64 beam and pseudo 32 beam Lidar point clouds can be seen in Fig.
4. We report average performance on two pseudo 32beam KITTI datasets.Metric
Same as others, average precision (AP) is used to evaluate our method. We follow the official KITTI evaluation protocol, i.e., the IoU threshold is 0.7 for car and 0.5 for pedestrian and cyclist. Precision and recall curves are computed using 40 points instead of 11 points.
Method  Input  Stage  3D Detection (Car)  3D Detection (Cyclist)  3D Detection (Pedestrian)  

Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
PIXOR[34]  L  One                   
ComplexYOLO[28]  L  One  47.34  55.93  42.60  18.53  24.27  17.31  13.96  17.60  12.70 
VoxelNet[39]  L  One  65.11  77.47  57.73  48.36  61.22  44.37  39.48  33.69  31.51 
SECONDV1.5[33]  L  One  75.96  84.65  68.71             
HRSECOND[33]  L  One  75.32  84.78  68.70  60.82  75.83  53.67  35.52  45.31  33.14 
PointPillar[14]  L  One  74.31  82.58  68.99  58.65  77.10  51.92  41.92  51.45  38.89 
3D IoU Loss[38]  L  One  76.50  86.16  71.39             
HRIVoxelFPN[30]  L  One  76.70  85.64  69.44             
ContFuse [16]  I + L  One  68.78  83.68  61.67             
MV3D [2]  I + L  Two  63.63  74.97  54.00             
AVODFPN [13]  I + L  Two  71.76  83.07  65.73  50.55  63.76  44.93  42.27  50.46  39.04 
FPointNet [23]  I + L  Two  69.79  82.19  60.59  56.12  72.27  49.01  42.15  50.53  38.08 
MMF [15]  I + L  Two  77.43  88.40  70.22             
PointRCNN [27]  L  Two  75.64  86.96  70.70  58.82  74.96  52.53  39.37  47.98  36.01 
FastPointRCNN[3]  L  Two  77.40  85.29  70.24             
HotSpotNetDense  L  One  78.34  88.12  73.49  62.72  79.09  56.76  39.72  47.14  37.25 
HotSpotNetDirect  L  One  77.74  86.40  72.97  63.16  77.70  57.16  44.81  51.29  41.13 
Method 
Input  Stage  BEV Detection (Car)  BEV Detection (Cyclist)  BEV Detection (Pedestrian)  
Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
PIXOR[34]  L  One  80.01  83.97  74.31             
ComplexYOLO[28]  L  One  68.96  77.24  64.95  25.43  32.00  22.88  18.26  21.42  17.06 
VoxelNet[39]  L  One  79.26  89.35  77.39  54.76  66.70  50.55  46.13  40.74  38.11 
SECONDV1.5[33]  L  One  86.37  91.81  81.04             
HRSECOND[33]  L  One  86.40  91.68  81.40  64.21  78.79  57.82  40.06  50.05  36.47 
PointPillar[14]  L  One  86.56  90.07  82.81  62.73  79.90  55.58  48.64  57.60  45.78 
3D IoU Loss[38]  L  One  86.22  91.36  81.20             
HRIVoxelFPN[30]  L  One  87.21  92.75  79.82             
ContFuse [16]  I + L  One  85.35  94.07  75.88            
MV3D [2]  I + L  Two  78.93  86.62  69.80             
AVODFPN [13]  I + L  Two  84.82  90.99  79.62  57.12  69.39  51.09  50.32  58.49  46.98 
FPointNet [23]  I + L  Two  84.67  91.17  74.77  61.37  77.26  53.78  49.57  57.13  45.48 
MMF [15]  I + L  Two  88.21  93.67  81.99             
PointRCNN [27]  L  Two  87.39  92.13  82.72  67.24  82.56  60.28  46.13  54.77  42.84 
FastPointRCNN[3]  L  Two  87.84  90.87  80.52             
HotSpotNetDense  L  One  88.11  93.73  84.98  66.86  82.13  60.86  44.59  50.87  42.14 
HotSpotNetDirect  L  One  87.95  93.59  83.21  67.20  79.66  61.04  49.48  55.90  45.79 
5.2 Implementation Details
Backbone Network
In all experiments, we adopt the similar backbone network architecture as the open source implementation in
[33]. The details of our architecture of backbone network are show in Fig. 5. We use the simplified version of Voxel Feature Encoder, i.e. VFE [39], by taking the mean of fixed amount of points sampled in a voxel. Our backbone network has 3D and 2D parts. 3D part has in total 3D sparse convolution blocks and the 2D part has 2D convolution blocks. For the 3D part, we use the convolution proposed in [9], including sparse convolution and submanifold convolution. We downsample in total four times except the last layer we only downsample in the height dimension. For the transformation of 3D to 2D, we collapse the height dimension. For instance, if the dimension of height is and channel dimension is , after collapsing, the channel dimension is .ObjectasHotspots Head Since the output feature map of the backbone network collapses to bird eye view, we thus in this paper assign hotspots in bird eye view (though they can also be extended to 3D space). As shown in Fig. 2, our OHS head consists of a shared convolution layer ((c) in the Fig. 4) with stride . We use a convolution layer followed by a sigmoid to predict confidence for hotspots. For the regression, we apply several convolution layers to different regressed values. For instance, two convolution layers are stacked to predict soft argmin for , and another convolution layer is used to predict soft for . Additional convolution layer to predict the dimensions and another convolution layer for rotation. We set the range with 16 bins for , , range with 16 bins for . For quadrant classification, we use another a convolution layer with softmax for crossentropy classification.
We set and for focal loss. We set , and in the weighted sum of losses.
We set , for car and , for pedestrian and cyclist. We set the effective region of pedestrian and cyclist larger to cover more hotspots.
Training and Inference We train the entire network endtoend with adamW [20] optimizer and onecycle policy [29] with LR max , division factor , momentum ranges from to
, fixed weight decay 0.01. We train the network with batch size 8 for 150 epochs. During the testing, we keep 100 proposals after filtering the confidence lower than 0.3, and then apply the rotated NMS with IOU threshold 0.01 to remove redundant boxes.
Data Augmentation We augment data using random flipping along the dimension in velodyne coordinates, global scaling with scaling factor sampled from , global rotation around axis by an angle sampled from . We also apply ground truth database sampling [33] on each instance by translating with gaussian noise , and for , , respectively and rotating around axis with a uniform noise from .
5.3 Experimental results on KITTI benchmark
As shown in Table 1, we evaluate our method on the 3D detection benchmark and the bird’s eye view detection benchmark of the KITTI test dataset. For the 3D object detection benchmark, by only using LiDAR point clouds, our proposed HotSpotNetDirect outperforms all previous peerreviewed LiDARbased, onestage detectors on cars, cyclists and pedestrian of all difficulty levels. In particular, HotSpotNetDirect shows its advantages on small and difficult objects, for example, pedestrian and cyclists. When detecting relatively large objects, such as cars, the empty voxels, i.e. dense hotspots assignment, may provide help, consequently, HotSpotNetDense presents some superiority on cars. On the contrary, it introduces noise to the small and sparse LiDAR scanned objects, therefore, it performs worse on cyclist and pedestrian. In the rest of the paper, without further emphasizing, we will adopt HotSpotNetDirect as our method for quantitative evaluations. Note that all other methods listed in Table 1 are anchorbased except that PIXOR [34] is an anchorfree 2D detector. The inspiring results show the success of representing objects as hotspots as well as potentials of anchorfree detectors in 3D detection. Our approach also beats some classic 3D twostage detectors, even when they fuse LiDAR and RGB images information. Note that our method rank 1st on pedestrian, outperforming all the submitted results on KITTI test set.
5.4 Experimental results for pseudo 32beam KITTI
To further verify our advantages on handling sparse LiDAR point clouds. We train and validate our proposed method on pesudo 32beam KITTI, compared with SECOND [33]. As shown in Table 2, HotSpotNet significantly outperforms the baseline on all categories in all difficulty levels. Notably, our method achieves more obvious improvements on the hard level, where objects are usually far away, occluded and truncated, and thus most sparsely captured by LiDAR. This validates our motivation to tackle the sparse nature of LiDAR point clouds.
Method  3D Detection on Car  3D Detection on Cyclist  3D Detection on Pedestrian  

Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
[33]  
Ours  74.08  87.90  70.60  59.23  79.40  55.42  58.81  64.00  52.89 
Method  BEV Detection on Car  BEV Detection on Cyclist  BEV Detection on Pedestrian  
Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
[33]  
Ours  84.90  93.10  81.65  62.36  82.12  58.60  64.48  70.34  58.74 
5.5 Ablation Studies
Effect of quadrant classification To prove the effectiveness of quadrant classification, we show the results of our HotSpotNet with and without quadrant classification on KITTI validation split for cars in Table 4. We can see that when our algorithm trained with the quadrant classification, the overall performance is boosted. Especially, the great improvement can be observed in hard level. To show that our HotSpotNet design can benefit from quadrant classification, we also add quadrant classification to the baseline SECOND [33]. Interestingly, we find quadrant classification impairs the baseline performance. One hypothesis is anchorbased methods ‘treats an object as a whole’, on the contrary, the quadrant classification is trying to encode part information, resulting in a conflict. Quadrant classification can particularly handle the ambiguities raised by anchorfree algorithm.
Method  3D Detection on Car  3D Detection on Cyclist  3D Detection on Pedestrian  

Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
Ours w/o quadrant  89.48  72.77  
Ours w quadrant  82.75  91.87  80.22  72.55  68.08  65.9  60.06  
[33] w/o quadrant  
[33] w/ quadrant  
Method  BEV Detection on Car  BEV Detection on Cyclist  BEV Detection on Pedestrian  
Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
Ours w/o quadrant  88.86  90.63  76.38  
Ours w quadrant  89.67  95.88  74.97  70.53  69.28  63.58  
[33] w/o quadrant  
[33] w/ quadrant 
Method  3D Detection on Car  3D Detection on Cyclist  3D Detection on Pedestrian  

Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
Ours w/o directions  
Ours w/ left&right  88.42  
Ours w/ front&back  
Ours w quadrant  82.75  80.22  72.55  68.08  72.23  60.06  
Ours w/ 8 directions  92.04  66.26  
Ours w/ deviation regression  
Method  BEV Detection on Car  BEV Detection on Cyclist  BEV Detection on Pedestrian  
Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
Ours w/o directions  88.86  
Ours w/ left&right  
Ours w/ front&back  
Ours w quadrant  89.67  95.88  74.97  90.41  70.53  
Ours w/ 8 directions  70.55  76.46  63.94  
Ours w/ deviation regression 
We proposed the quadrant classification as an additional loss to spatially localize the hotspots. This inherent and invariant objectpart relation enables HotSpotNet to converge fast. We further investigate the effects of different hotspotobject spatial relation encoding methods. Besides the quadrant partition presented above, we present four more types of encodings as shown in Fig. 6. We supervise our network using different spatial encoding targets: 1) classifying hotspot location into left or right part of the object; 2) classifying hotspot location into front or back of the object; 3) classifying hotspot location into quadrants of the objects; 4) classifying hotspot location into eight directions of the objects; 5) directly regressing deviation to the object center. Deviation is two decimals denoting the relative deviations from the center along the box width and length, and ranges within because we normalize the values by the box width and length. Thus, is the center of the box and the four corners are , , and . The performance of our approach without any spatial relation encoding is presented using ‘Ours w/o directions’. The performances of integrating different encodings into our approach are listed in Tab. 4. Generally, coarse, e.g. two partitions, leftright or frontback, or too sophisticated, e.g. eight directions, hotspotobject relations does not help the regression. The results show that only quadrant partition can significantly improve the performance.
Interestingly, adding additional deviation regression conversely degrades the performance. Further analyzing the reasons behind, we find deviation with the orientation together is sufficient for global center deviation regression, i.e. obtaining . Denote deviation along box dimensions as , and the relation between global center deviation and is represented by Eq. 6
(6) 
where is the rotation around axis and clockwise, and denote the length and width of the box.
Deviation regression introduces redundant target to our approach, and at the same time, forces the network to learn the complicated transformation (Eq. 6). When the network does not have enough representation power to learn this transformation, it gets overwhelmed by this redundant information and thus the performance drops. On the contrary, coarse hotspotobject spatial relations (front&back, left&right, quadrants and 8 directions) can be considered as relaxations for Eq. 6. They provide the inherent hotspotobject spatial relation. We find quadrant classification works best in our proposed method. Therefore, we adopt quadrant classification in the final experiments.
Effect of soft argmin We show the importance of soft argmin by removing it gradually. ‘direct’ in Table 5 means we directly regress the raw values of corresponding targets. We perceive improvements by using soft instead of the raw values. Particularly on cyclist and pedestrian, our soft argmin regression brings more improvements.
Method  3D Detection on Car  3D Detection on Cyclist  3D Detection on Pedestrian  

Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
, direct  
, direct  
direct  82.87  92.03  80.46  
Ours  72.55  88.22  68.08  65.9  72.23  60.06  
Method  BEV Detection on Car  BEV Detection on Cyclist  BEV Detection on Pedestrian  
Mod  Easy  Hard  Mod  Easy  Hard  Mod  Easy  Hard  
, direct  
, direct  88.44  
direct  
Ours  89.67  95.88  74.97  90.41  70.53  69.28  75.83  63.58 
5.6 Visualization of Hotspots
Previously, we introduce the concept of hotspots and their assignment methods. Do we really learn the hotspots and what do they look like? We trace our detection bounding boxes results back to original fired cliques and visualize them in Fig. 7. Here we visualize some samples with cars from validation dataset, all the fired hotspots are marked green. (a) presents the original LiDAR point clouds in BEV and (b) shows all the point clouds from the detected cars. Interestingly, all the fired hotspots sit at the front corner of the car. It shows that the front corner may be the most distinctive ‘part’ for detecting/representing a car.
6 Qualitative Results
We visualize some representative results in Fig. 8 and Fig. 9. Our baseline, SECOND [33], the anchorbased method, typically fails to detect (misses) the objects when they are far away from the sensors, i.e., the objects appear with sparse LiDAR point clouds. Our approach however is robust to these circumstances, as shown in Fig. 8. Both baseline and our approach suffer from false positives, as presented in Fig. 9 and 9. The future step may be investigating ways to incorporate appearance cues from RGB images to prevent false positives, for instance, trees being recognized as cars or road signs being recognized as pedestrians.
7 Conclusion
We propose a novel representation, objectashotspots and the first onestage and anchorfree 3D object detector, HotspotNet, for 3D object detection in autonomous driving scenario. Our anchorfree detector outperforms all previous onestage detectors on categories of KITTI dataset by a large margin. Extensive experiments show that our approach is robust and effective to sparse point clouds. We propose quadrant classification to encode the inherent relation between hotspots and objects and stabilize our network training. We believe our work sheds insights on rethinking 3D object representations and at the same time, shows the potential of anchorfree in 3D algorithm design.
8 Acknowledgement
We thank Ernest Cheung (Samsung), Gweltaz Lever (Samsung), Dr. Xingyu Zhang (Apple) and Chenxu Luo (Johns Hopkins University and Samsung) for useful discussions that greatly improved the manuscript.
References
 [1] (2018) Yolo3D: endtoend realtime 3d oriented object bounding box detection from lidar point cloud. In ECCV, Cited by: §2, §2.
 [2] (2017) Multiview 3d object detection network for autonomous driving. In CVPR, Cited by: §2, Table 1.
 [3] (201910) Fast point rcnn. In ICCV, Cited by: §2, §2, Table 1.
 [4] (2014) Unsupervised learning of dictionaries of hierarchical compositional models. In CVPR, Cited by: §1.
 [5] (2018) YOLO4D: a spatiotemporal approach for realtime multiobject detection and classification from lidar point clouds. Cited by: §2, §2.
 [6] (2014) Learning a hierarchical compositional shape vocabulary for multiclass object representation. arXiv preprint arXiv:1408.5516. Cited by: §1.
 [7] (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §5.
 [8] (2015) Fast rcnn. In ICCV, Cited by: §4.2.
 [9] (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, Cited by: §5.2.
 [10] (2006) Context and hierarchy in a probabilistic image model. In CVPR, Cited by: §1.
 [11] (2017) Endtoend learning of geometry and context for deep stereo regression. In ICCV, Cited by: §4.2.
 [12] (2017) Greedy structure learning of hierarchical compositional models. arXiv preprint arXiv:1701.06171. Cited by: §1.
 [13] (2018) Joint 3d proposal generation and object detection from view aggregation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §2, Table 1.
 [14] (2019) PointPillars: fast encoders for object detection from point clouds. In CVPR, Cited by: §1, §2, §2, Table 1.
 [15] (2019) Multitask multisensor fusion for 3d object detection. In CVPR, Cited by: Table 1.
 [16] (2018) Deep continuous fusion for multisensor 3d object detection. In ECCV, Cited by: §2, Table 1.
 [17] (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §2, §4.2.
 [18] (2017) Focal loss for dense object detection. In ICCV, Cited by: §4.2.

[19]
(2019)
Pointvoxel cnn for efficient 3d deep learning
. arXiv preprint arXiv:1907.03739. Cited by: §2.  [20] (2017) Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. Cited by: §5.2.
 [21] (2019) LaserNet: an efficient probabilistic 3d object detector for autonomous driving. In CVPR, Cited by: §2.
 [22] (2019) Deep hough voting for 3d object detection in point clouds. In ICCV, Cited by: §2.
 [23] (2018) Frustum pointnets for 3d object detection from rgbd data. In CVPR, Cited by: §2, Table 1.
 [24] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Neural Information Processing Systems, Cited by: §2, §4.
 [25] (2016) You only look once: unified, realtime object detection. In CVPR, Cited by: §2.
 [26] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Neural Information Processing Systems, Cited by: §2.
 [27] (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, Cited by: §2, §2, §2, Table 1.
 [28] (2018) Complexyolo: an eulerregionproposal for realtime 3d object detection on point clouds. In ECCV, Cited by: §2, §2, Table 1.
 [29] (2019) Superconvergence: very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for MultiDomain Operations Applications, Vol. 11006, pp. 1100612. Cited by: §5.2.
 [30] (2019) Voxelfpn: multiscale voxel feature aggregation in 3d object detection from point clouds. arXiv preprint arXiv:1907.05286. Cited by: Table 1.
 [31] (2018) Sgpn: similarity group proposal network for 3d point cloud instance segmentation. In CVPR, Cited by: §2.
 [32] (2019) Frustum convnet: sliding frustums to aggregate local pointwise features for amodal 3d object detection. In IROS, Cited by: §2.
 [33] (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §1, §2, §2, §5.2, §5.2, §5.4, §5.5, Table 1, Table 2, Table 3, Figure 8, Figure 9, §6.
 [34] (2018) Pixor: realtime 3d object detection from point clouds. In CVPR, Cited by: §2, §2, §5.3, Table 1.
 [35] (2019) Learning object bounding boxes for 3d instance segmentation on point clouds. arXiv preprint arXiv:1906.01140. Cited by: §2.
 [36] (2019) STD: sparsetodense 3d object detector for point cloud. arXiv preprint arXiv:1907.10471. Cited by: §2, §2.
 [37] (2018) DeepVoting: a robust and explainable deep network for semantic part detection under partial occlusion. In CVPR, Cited by: §1.
 [38] (2019) IoU loss for 2d/3d object detection. arXiv preprint arXiv:1908.03851. Cited by: Table 1.
 [39] (2018) VoxelNet: endtoend learning for point cloud based 3d object detection. In CVPR, Cited by: §1, §2, §2, §5.2, Table 1.
 [40] (2008) Unsupervised structure learning: hierarchical recursive composition, suspicious coincidence and competitive exclusion. In ECCV, Cited by: §1.