The code is implemented to show focal loss improvement based on 3D-FCN and VoxelNet for “Focal Loss in 3D Object Detection”.
3D object detection is still an open problem in autonomous driving scenes. Robots recognize and localize key objects from sparse inputs, and suffer from a larger continuous searching space as well as serious fore-background imbalance compared to the image-based detection. In this paper, we try to solve the fore-background imbalance in the 3D object detection task. Inspired by the recent improvement of focal loss on image-based detection which is seen as a hard-mining improvement of binary cross entropy, we extend it to point-cloud-based object detection and conduct experiments to show its performance based on two different type of 3D detectors: 3D-FCN and VoxelNet. The results show up to 11.2 AP gains from focal loss in a wide range of hyperparameters in 3D object detection. Our code is available at <https://github.com/pyun-ram/FL3D>.READ FULL TEXT VIEW PDF
The code is implemented to show focal loss improvement based on 3D-FCN and VoxelNet for “Focal Loss in 3D Object Detection”.
3D object detection is an interesting problem in robotic perception, the applied scenes of which widely include urban and suburban roads, high way, bridges and indoor settings. Robots recognize and localize key objects from data in the 3D form and predict their locations, sizes and orientations, which provides both semantic and spatial information for high-level decision making. Point clouds are one of the mainly 3D data forms, which can be gathered by range cameras, like LiDAR and RGB-D cameras. Since the coordinate information of point clouds is not influenced by appearance change, point cloud representation is robust in even extream weathers and variant seasons. In addition, it is naturally scale-invariant, i.e. the scale of an object is invariant anywhere in a point cloud, while it always changes in an image due to foreshortening effects. Besides, the increasing range and decreasing price of 3D LiDAR provide a promising direction for autonomous driving researchers.
Current image-based detectors benefit translation invariance from convolution operations and perform human-like accuracy. However, the successful image-based architectures cannot be directly applied in 3D space. Point-cloud-based object detection consumes point clouds which are sparse point lists instead of dense arrays. If drawing on the success of image-based detectors and conducting dense convolution operation to acquire translation invariance, pre-processing must be implemented to convert the sparse point clouds into dense arrays. Otherwise, special layers should be carefully designed to extract meaningful features from the sparse inputs. On the other hand, the foreground-background imbalance is much more serious than in the 2D scenarios, since the new z-axis further improves the searching space and the number of positive objects stays at the same order of magnitude as in image-based object detection.
Lin et al.
proposed focal loss to solve the fore-background imbalance in image-based detectors, so that one-stage detectors can perform state-of-the-art accuracy as two-stage detectors in image-based detection. It can be seen as a hard-mining improvement of binary cross entropy to help network focus on hard classified objects in case they are overwhelmed by a large amount of easily classified objects.
Similar to image-based detection methods, point-cloud-based detection methods can be classified into two-stage [2, 3, 4] and one-stage detecors [5, 6]. In this paper, inspired by Lin et al., we try to solve the fore-background imbalance in 3D object detection. We claim the following contributions:
We extend focal loss to 3D object detection to solve the huge fore-background imbalance in one-stage detectors, and conduct experiments on two different one-stage 3D object detectors - 3D-FCN and VoxelNet (Table. I). The experiment results demonstrate up to 11.2 AP gains from focal loss in a wide range of hyperparameters.
To further understand focal loss in 3D object detection, we analyze its effect towards foreground and background estimations in both 3D-FCN and VoxelNet. We validate that it plays a role similar to its in image-based detection and find VoxelNet special architecture can naturally well handle the hard negatives.
We plot the final confidence distributions of the two detectors and demonstrate that focal loss with increasing hyperparameter decreases the estimation confidence.
When extending two-stage image detectors to 3D, following problems appear: the input is sparse and at low resolution; the original method is not guaranteed to have enough information to generate region proposals, especially for small object classes. Ku et al. designed AVOD, which fuses RGB images and point clouds 
. It firstly proposes aligned 3D bounding boxes with a multimodal fusion region proposal network and then classifies and regresses the proposed bounding boxes with fully connected layers. Both appearance and 3D information are well-utilized to improve the accuracy and robustness in extreme scenes. Their hand-craft feature can be further improved, which it is sub-optimal to minimize their loss function.
Qi et al. leveraged both 2D object detectors and 3D deep learning for object localization. They extracted the 3D bounding frustum of an object with a 2D object detector. Then the 3D instance segmentation and 3D bounding box regression were applied with two variants of PointNet . F-PointNet achieves the state-of-the-art accuracy on the KITTI challenge, while it also performs at real-time speed in 3D detection tasks. Their image detector requires to be well designed with a high recall rate, since the accuracy upper bound is determined by the first stage.
Li et al. extended the 2D fully convolutional network to 3D (3D-FCN) 
. The voxelized point clouds are processed by an encoder-decoder network. The 3D fully convolutional network finally proposes a probability and a regression map for the whole detection region. It thoroughly consists of 3D dense convolutions with high computation and memory costs, so that network is shallow and abstract features can hardly be extracted. Unlike AVOD, which adopts hand-crafted features to represent the point clouds, Zhou et al. design an end-to-end network to implement point-cloud-based 3D object detection with learning representations (VoxelNet). Compared to 3D-FCN , the computation cost is mitigated by the Voxel Feature Encoding Layers (VFELayers) and 2D convolution.
In this paper, we adopt 3D-FCN and VoxelNet as two different type of one-stage 3D detectors. As shown in Table. I
, 3D-FCN consumes dense grids and consists of only 3D dense convolution layers, where 2D FCN architecture is extended to 3D for dense feature extraction. In contrast, VoxelNet consumes sparse point lists and is a heterogeneous network, which firstly extracts sparse features with its novel VFELayers and then conducts 3D and 2D convolution sequentially.
|Input||Dense Grid||Dense Grid||Sparse Point List|
|Network||Dense Conv||Dense Conv||Heterogeneous|
Image-based object detectors can be classified into two-stage and one-stage detectors. For two-stage detectors, like R-CNN , the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as the background using a CNN. The two-stage detectors [9, 10] achieve state-of-the-art accuracy on the COCO benchmark. The one-stage detectors, on the other hand, aim to simplify the pipeline like YOLO  and SSD . They improve the speed of the network and also demonstrate promising results in terms of accuracy.
Lin et al. explores both one-stage and two-stage detectors in the image-based object detection, and claims that the hurdle that obstructs the one-stage detectors from better accuracy is the extreme fore-background class imbalance encountered during training of dense detectors . They reshaped the standard cross entropy loss and proposed focal loss such that the losses assigned to well-classified examples were down-weighted. It can be seen as a hard-mining improvement of binary cross entropy to help network focus on hard classified objects in case they are overwhelmed by a large amount of easily classified objects.
We extend focal loss to the task of 3D object detection to solve the problem of fore-background imbalance. Different from image-based detection, point-cloud-based object detection is a more challenging perception problem in 3D space with sparse sensor data and suffers from more serious fore-background imbalance. To thoroughly evaluate the performance of focal loss in this harder task, we conduct experiments based on two different type of one-stage 3D detectors: 3D-FCN and VoxelNet. We analyze the focal loss effect on these two 3D detectors following the similar method as , and further discuss the decreasing confidence effect of focal loss.
In this section, we firstly declare notations and revisit focal loss, and then analyze the fore-background imbalance in 3D object detecion.
We define as the ground-truth class, and as the estimated probability for the class with label . For notational convenience, we define :
The binary cross entropy loss (BCE loss) and its deviation w.r.t. x 111p is calculated from . can be formulated as
As claimed in , when the network is trained with BCE loss, its gradient will be dominated by vast easy negative samples if a huge fore-background imbalance exists. Focal loss can be considered as a dynamically scaled cross entropy loss, which is defined as
The loss contribution of well-classified samples () will be down-weighted. The hyperparameter of focal loss can be used to tune the effect of focal loss. As increases, less easy classified samples will contribute loss. When reaches , focal loss degrades into BCE loss (Figure. 1). In the following section, all the cases with represent BCE loss cases.
Researchers previously either introduced hyperparameters to balance the losses calculated from positive and negative anchors, or normalized positive and negative losses by the frequency of corresponding anchors, such that all sublosses can be balanced into a same order of magnitudes. However, one important thing these two previous methods cannot handle is the gradient salience of hard negative samples that the gradients of hard negative anchors () are overwhelmed by a large amount of easy negative anchors (). Due to dynamic scaling with confidence , a weighted focal loss can be used to handle both fore-background imbalance and gradient salience of hard negative samples with the following form.
where is induced to weight different classes. In the following sections, we adopt the weighted focal loss form and adopt hyperparameter for positive focal loss and for negative focal loss.
The methods for 3D object detection can be classified as two-stage [2, 3, 4] and one-stage detectors [5, 6]. The two-stage detectors first adopt an algorithm with a high recall rate to propose regions that possibly contain objects and adopt a convolution network to classify classes and regress bounding boxes. The one-stage detectors are end-to-end networks that learn representations and implement classification and regression in all anchors.
In one-stage methods, anchors are proposed at each location, thus a huge fore-background imbalance exists. For instance, there are 50k bounding boxes proposed in each frame for 3D FCN while 70k for VoxelNet, but less than 30 anchors among them contains positive objects (i.e. car, pedestrian, cyclist). In contrast, the first-stage proposal can help alleviate the fore-background imbalance in two-stage methods, since it only proposes hundreds of bounding boxes with a high recall rate. The one-stage methods for 3D detectors are different from those 2D detectors, because of its larger searching space, different type of network architectures and sparse input. Therefore, we select two different networks: 3D-FCN and VoxelNet to conduct experiments to evaluate focal loss performance in 3D object detection. The feature of these two 3D detectors will be discussed in the following two sections, and the experiments details and results will be shown in Sec. VI.
In this section, we claim the dense convolution network architecture of 3D-FCN and introduce our enhanced loss function for 3D-FCN. The details of 3D-FCN can be referred to , and our implementation of 3D-FCN can be found in the APPENDIX.
3D-FCN  draws the experiences from image-based recognition tasks, and extends the 2D convolution layer to 3D space to acquire translation invariance. The input point cloud is firstly voxelized into a 3D dense grid. In each voxel of the 3D dense grid, the values are used to present whether there is any point observed. The network architecture of 3D-FCN is shown in Figure. 2. The voxelized point cloud is convolved by four Conv3D blocks sequentially. The output features then separately processed by two Conv3D to generate a probability map and a regression map (P-Map and R-Map). Different from image-based object detection, the probability map and regression map are all in 3D dense grids, so that the searching space is exponentially increased.
The origin loss function for 3D-FCN is in a simple version , where only classification and regression loss are balanced. We adopt the loss used in , which normalizes sub-loss with corresponding frequency as well as introduce hyperparameters , , and to balance them, so that positive and negative classification loss as well as classification and regression loss could be in a same order of magnitudes. The loss function is in the following form.
where and represent the classification loss and regression loss, while and represent the number of positive and negative voxels respectively. In regression loss , and are the regression output and groundtruth for positive anchors, while denotes the square loss. In classification loss , refers to the binary cross entropy Eq. 2 or focal loss Eq. 6, while and represent the confidence of positive and negative estimation respectively.
In this section, we claim the heterogeneous network architecture of VoxelNet, and its bird’s-eye-view estimation. The details of VoxelNet can be referred to , and our implementation of VoxelNet can be found in the APPENDIX.
The heterogeneous architecture overview of VoxelNet is shown in Figure. 3. It consists of three main parts: FeatureNet (point-wise and voxel-wise feature transformation), MiddleLayer (3D dense convolution) and RPN (2D dense convolution).
FeatureNet extracts features directly from sparse point lists. It adopts Voxel Feature Encoding Layers (VFELayer)  to extract both point-wise and voxel-wise features directly from points, where fully connected layers are used to extract point-wise features and symmetric function is used to aggregate local features from all points within a local voxel. Compared to sub-optimally deriving hand-crafted features from voxels (e.g. binary value representing non-empty voxels), VFELayers are able to learn optimal representations minimizing the loss function.
The derived voxel-wise representations from VFElayers are sparse. The sparse representation saves memory and computation costs. In contrast, if a point cloud of KITTI dataset is partitioned into a dense grid for vehicle detection, only around 5300 voxels (about 0.3%222Non-empty voxels in this dense grid .) are non-empty. However, the sparse representation is currently unfriendly to convolutional operation. In order to implement convolution, VoxelNet compromises some efficiency and converts the sparse representation to a dense representation. Each sparse voxel-wise representation is copied to its specific entry in the dense grid.
MiddleLayer consumes the 3D dense grid and converts it to a 2D bird’s-eye-view form, so that further process could be done in 2D space. The role of MiddleLayer is to learn features from all voxels in the same bird’s-eye-view location. Therefore, the 3D convolutional kernel is of size , if we denote the dense grid in the order of . The 3D kernel of size helps aggregate voxel-wise features within a progressively expanding receptive field along the z-axis. and keeps the shape in the dimension.
predicts probability and regression map from the 2D bird’s-eye-view feature map. It does not utilize max-pooling and adopts skip-layers
to combine high-level semantic features and low-level spatial features. We interpret this design that the increased invariance and large receptive fields of top-level nodes will yield smooth responses which cause inaccurate localization, even though strides and max-pooling provide deep convolutional neural networks with spatial invariance.
The final probability and regression estimation map are all in bird’s-eye-view form, which is similar to the final estimation of image-based detection methods. It saves both memory and calculation compared to the 3D maps, but it can only estimate one object per location in bird’s eye view. It is acceptable in autonomous driving scenes but will meet problems in the indoor scenes where objects can be stacked up (eg. the mug on the book).
MiddleLayer saves calculation for further process by aggregating 3D dense grid into a 2D bird’s-eye-view feature map. Otherwise, throughly 3D dense convolution in such a deep network (22 convolution layers) will bring exponentially more parameters and calculation. We note that currently MiddleLayer is still a bottleneck of the whole network as GFLOPs in Table. VI because of its 3D dense convolution operation. The efficient sparse convolutional implementation is still an open problem and deserves effort.
|Bird’s Eye View AP (%)||3D Detection AP (%)|
|Bird’s Eye View AP (%)||3D Detection AP (%)|
In this section, we intend to answer two questions: 1) Can focal loss help improve accuracy in 3D object detection task? 2) Does focal loss play an equal effect in 3D object detection as in 2D? To answer 1), we conduct experiments to compare the performance of 3D-FCN and VoxelNet trained with BCE loss and focal loss on challenging KITTI benchmark . To answer 2), we analyze the cumulative distribution curve of 3D-FCN and VoxelNet following the similar method as .
The KITTI 3D object detection dataset contains 3D annotations for car, pedestrian and cyclist in urban driving scenarios. The sensor setup mainly consists of a wide-angle camera and a Velodyne LiDAR (HDL-64E), both of which are well-calibrated. The training dataset contains 7481 frames which are with raw sensor data and annotations. We follow  and split the dataset into training and validation sets, each containing around half of the entire set. For simplicity, we conduct experiment only on car class, since both 3D-FCN and VoxelNet are trained class-specifically and extending it to other classes are straightforward things but tuning techniques. Besides, the focal loss is agnostic to the class of object in terms of E.q. 6.
The network details of both 3D-FCN and VoxelNet are shown in Table. V and VI in the APPENDIX. We tune , and so that and as well as and could be in a same order of magnitudes. We set , , in 3D-FCN and , , in VoxelNet. KITTI 3D detection dataset contains some noise annotation that empty bounding box containing no points. In order to avoid overfitting the dataset, we remove all bounding boxes containing few points (less than 10 points).
As claimed in 
, when training a network with focal loss from scratch, it is unstable at the beginning. Therefore, in order to stabilize training, we train the network (both 3D-FCN and VoxelNet) 30 epochs with BCE loss and learning rate. and continue training it with focal loss for another 30 epochs with specific and learning rate . We compare the results of the last epoch in Table. III and III, where the rows with represent BCE loss results, and the rows with represent focal loss results, while bolded numbers are the results that focal loss cases outperforms the BCE loss case.
In general, VoxelNet outperforms 3D-FCN in accuracy, since the input of VoxelNet is with original point clouds while 3D-FCN voxelized the point clouds with information loss. Besides, VoxelNet is with deeper networks and the experiences in image-based recognition tasks show that deeper networks are able to extract more useful high-level features. In 3D-FCN, focal loss helps improve accuracy in all metrics in a wide range of hyperparameters (). Focal loss provides gains from 0.5 AP to 11.2 AP in these cases. In VoxelNet, the cases with shows the gains from focal loss in all metrics. The gains range from 0.6 AP to 9.1 AP. When , both gains and degrades happen. But the best result is trained with focal loss, and gains are generally much more than degrades because the degrades are up to 2.7AP, while the gains are up to 9.1 AP. Therefore, in 3D object detection, focal loss can help improve accuracy in a wide range of (normally ), which is different from network to network.
From this table, we can get the same result that focal loss helps improve accuracy in 3D object detection. Note that all cases in this Table are the best result among all intermediate weights, thus the accuracy improvement is from focal loss instead of longer training steps.
We analyze the empirical cumulative distributions of the loss of the converged 3D-FCN and VoxelNet as . We evaluate all intermediate weights and select the best model which is detailed in Table. IV. We apply the two converged models trained with focal loss (row 2 and row 4 in Table. IV) to the validation dataset and sample the predicted probability for negative windows and positive windows. Then, we calculate focal loss with these probability data. The calculated focal loss is normalized such that it sums to one and sorted from low to high. We plot the cumulative distributions for 3D-FCN and VoxelNet for different .
The cumulative distributions for different of 3D-FCN and VoxelNet are shown in Figure. 4. In 3D-FCN, approximately 15% of the hardest positive samples account for roughly half of positive loss. As increases, more of the loss gets concentrated in the top 15% of examples. However, compared to the effect of focal loss on negative samples, its effect on positive samples is minor. For , the positive and negative CDFs are quite similar. As increases, substantially more weight becomes concentrated on the hard negative examples. With (best result for 3D FCN), the vast majority of the loss comes from a small fraction of samples. As claimed in , focal loss can effectively discount the effect of easy negatives, so that the network will focus on learning the hard negative examples.
In VoxelNet, the condition is different. From Figure. 4 bottom row, we can see the effect of focal loss increases in both positive and negative samples as
increases. But the cumulative distribution function for negative samples are quite similar among different values of, even though we adjust the x-axis to . It shows that VoxelNet trained with binary cross entropy is already able to handle negative hard samples. Compared with the effects on negative samples, the effects of focal loss on positive samples is stronger. Therefore, the accuracy gains of focal loss in VoxelNet are mainly from positive hard samples.
From the analysis of cumulative distributions, we believe that in 3D object detection, focal loss prevents the network from fore-background imbalance and helps network alleviate hard sample gradient salience in the training process.
During experiment conduction, we found the network trained with focal loss should be set with a lower threshold for non-maximum suppression. It inspired us to explore the effect of focal loss on output confidence. We take the models in Table. III and III, and evaluate them on the validation set. We record all the evaluation result and plot the histogram of positive bounding box probabilities. The results are shown in Figure. 5. As increases, the peak decreases and moves towards left. It demonstrates that the network trained with focal loss outputs positive estimation with less confidence. It can be understood that the objects with high confidence are easily classified objects, and the loss they contribute are down-weighted with focal loss in the training process. In other words, they will be relatively ignored in the training process if they are estimated with high confidence, so that their confidence cannot be further improved. But they can also be accurately classified if we decrease the non-maximum suppression threshold in the final output step.
In this paper, we extended focal loss of image detectors to 3D object detection to solve the foreground-background imbalance. We adopted two different types 3D object detectors to demonstrate the performance of focal loss in point-cloud based object detection. The experiment results show that focal loss helps improve accuracy in 3D object detection， and it prevents the network from hard sample gradient salience both for positive and negative anchors in the training process. The confidence histograms of models trained with focal loss show that outputs positive estimation with less confidence.
In this appendix, we describe our implementation details about 3D-FCN and VoxelNet.
The network details of 3D-FCN is shown in Table. V
. Each Conv3D block in the BodyNet applies 3D convolution, ReLU and batch normalization sequentially. In the HeadNet, each conv3D block applies only 3D convolution. The sigmoid function is applied to get the posterior probability. Finally, non-maximum suppression is applied to output the most confident one among all overlapped bounding boxes.
In the training phase, we create the ground truth for P-Map by setting the object-voxel that containing object center as 1, other non-object voxels as 0. For the regression map, we create the ground truth by setting the object-voxels with 24-length residual vector, other non-object voxels as 24-length zeros. The 24-length residual vectors are the coordinates for the 8 points of the bounding box with the fixed order as.
Our implementation of 3D-FCN baseline is shown in row 1 of Table. IV. It is not as good as claimed in , the reasons are 1) we eliminated all empty ground-truth bounding boxes which contain few points (less than 10), thus some positive samples cannot be recognized. 2) we trained the network with only 60 epochs. 3) we simplified the network architecture to reduce memory costs by removing the deconvolution layers. Even though it is not state-of-the-art results, the experiment results can demonstrate the focal loss helps improve accuracy in 3D object detection.
The network details of VoxelNet is shown in Table. VI. The FC block in VoxelNet consists of a linear fully connected layer, a batch normalization layer and a non-linear (ReLU) layer sequentially. Each Conv3D block in the MiddleLayer applies 3D convolution, ReLU and batch normalization, while each Conv2D block in the RPN applies 2D convolution, ReLU and batch normalization sequentially. P-Map and R-Map consist of only 2D convolution layers without ReLU or batch normalization. The sigmoid function is applied to get the posterior probability. Non-maximum suppression is also applied finally.
We adopt the original parameterization method of VoxelNet. A 3D bounding box is parameterized as , where represent the center location, are length, width, and height of the box, is the yaw rotation around the Z-axis.
For regression, the residual vector between the ground truth and an anchor is denoted as
where is the diagonal of the base of the anchor box, and superscript denotes ground-truth while denotes anchor box. The loss function for VoxelNet is similar to 3D-FCN, but with SmoothL1 Loss for regression loss as .
Our implementation results are not as good as claimed in , the reasons are 1) we used a small batch size 1. Zhou et al. mentioned that their batch size was set as 16 in VoxelNet, but a normal GPU with 12 GB Memory can only support a batch size of 2. 2) we trained the network with only 60 epochs while they trained VoxelNet 120 epochs. Even though it is not state-of-the-art results, the experiment results can demonstrate the focal loss helps improve accuracy in 3D object detection. Our released code and weights can help researchers easily reimplement our results.
|Block Name||Layer Name||Kernel Size||Strides||Filter||GFLOPs|
|Block Name||Layer Name||