FL3D
The code is implemented to show focal loss improvement based on 3DFCN and VoxelNet for “Focal Loss in 3D Object Detection”.
view repo
3D object detection is still an open problem in autonomous driving scenes. Robots recognize and localize key objects from sparse inputs, and suffer from a larger continuous searching space as well as serious forebackground imbalance compared to the imagebased detection. In this paper, we try to solve the forebackground imbalance in the 3D object detection task. Inspired by the recent improvement of focal loss on imagebased detection which is seen as a hardmining improvement of binary cross entropy, we extend it to pointcloudbased object detection and conduct experiments to show its performance based on two different type of 3D detectors: 3DFCN and VoxelNet. The results show up to 11.2 AP gains from focal loss in a wide range of hyperparameters in 3D object detection. Our code is available at <https://github.com/pyunram/FL3D>.
READ FULL TEXT VIEW PDFThe code is implemented to show focal loss improvement based on 3DFCN and VoxelNet for “Focal Loss in 3D Object Detection”.
3D object detection is an interesting problem in robotic perception, the applied scenes of which widely include urban and suburban roads, high way, bridges and indoor settings. Robots recognize and localize key objects from data in the 3D form and predict their locations, sizes and orientations, which provides both semantic and spatial information for highlevel decision making. Point clouds are one of the mainly 3D data forms, which can be gathered by range cameras, like LiDAR and RGBD cameras. Since the coordinate information of point clouds is not influenced by appearance change, point cloud representation is robust in even extream weathers and variant seasons. In addition, it is naturally scaleinvariant, i.e. the scale of an object is invariant anywhere in a point cloud, while it always changes in an image due to foreshortening effects. Besides, the increasing range and decreasing price of 3D LiDAR provide a promising direction for autonomous driving researchers.
Current imagebased detectors benefit translation invariance from convolution operations and perform humanlike accuracy. However, the successful imagebased architectures cannot be directly applied in 3D space. Pointcloudbased object detection consumes point clouds which are sparse point lists instead of dense arrays. If drawing on the success of imagebased detectors and conducting dense convolution operation to acquire translation invariance, preprocessing must be implemented to convert the sparse point clouds into dense arrays. Otherwise, special layers should be carefully designed to extract meaningful features from the sparse inputs. On the other hand, the foregroundbackground imbalance is much more serious than in the 2D scenarios, since the new zaxis further improves the searching space and the number of positive objects stays at the same order of magnitude as in imagebased object detection.
Lin et al.[1]
proposed focal loss to solve the forebackground imbalance in imagebased detectors, so that onestage detectors can perform stateoftheart accuracy as twostage detectors in imagebased detection. It can be seen as a hardmining improvement of binary cross entropy to help network focus on hard classified objects in case they are overwhelmed by a large amount of easily classified objects.
Similar to imagebased detection methods, pointcloudbased detection methods can be classified into twostage [2, 3, 4] and onestage detecors [5, 6]. In this paper, inspired by Lin et al.[1], we try to solve the forebackground imbalance in 3D object detection. We claim the following contributions:
We extend focal loss to 3D object detection to solve the huge forebackground imbalance in onestage detectors, and conduct experiments on two different onestage 3D object detectors  3DFCN and VoxelNet (Table. I). The experiment results demonstrate up to 11.2 AP gains from focal loss in a wide range of hyperparameters.
To further understand focal loss in 3D object detection, we analyze its effect towards foreground and background estimations in both 3DFCN and VoxelNet. We validate that it plays a role similar to its in imagebased detection and find VoxelNet special architecture can naturally well handle the hard negatives.
We plot the final confidence distributions of the two detectors and demonstrate that focal loss with increasing hyperparameter decreases the estimation confidence.
When extending twostage image detectors to 3D, following problems appear: the input is sparse and at low resolution; the original method is not guaranteed to have enough information to generate region proposals, especially for small object classes. Ku et al. designed AVOD, which fuses RGB images and point clouds [3]
. It firstly proposes aligned 3D bounding boxes with a multimodal fusion region proposal network and then classifies and regresses the proposed bounding boxes with fully connected layers. Both appearance and 3D information are wellutilized to improve the accuracy and robustness in extreme scenes. Their handcraft feature can be further improved, which it is suboptimal to minimize their loss function.
Qi et al. leveraged both 2D object detectors and 3D deep learning for object localization
[2]. They extracted the 3D bounding frustum of an object with a 2D object detector. Then the 3D instance segmentation and 3D bounding box regression were applied with two variants of PointNet [7]. FPointNet achieves the stateoftheart accuracy on the KITTI challenge, while it also performs at realtime speed in 3D detection tasks. Their image detector requires to be well designed with a high recall rate, since the accuracy upper bound is determined by the first stage.Li et al. extended the 2D fully convolutional network to 3D (3DFCN) [5]
. The voxelized point clouds are processed by an encoderdecoder network. The 3D fully convolutional network finally proposes a probability and a regression map for the whole detection region. It thoroughly consists of 3D dense convolutions with high computation and memory costs, so that network is shallow and abstract features can hardly be extracted. Unlike AVOD, which adopts handcrafted features to represent the point clouds, Zhou et al. design an endtoend network to implement pointcloudbased 3D object detection with learning representations (VoxelNet)
[6]. Compared to 3DFCN [5], the computation cost is mitigated by the Voxel Feature Encoding Layers (VFELayers) and 2D convolution.In this paper, we adopt 3DFCN and VoxelNet as two different type of onestage 3D detectors. As shown in Table. I
, 3DFCN consumes dense grids and consists of only 3D dense convolution layers, where 2D FCN architecture is extended to 3D for dense feature extraction. In contrast, VoxelNet consumes sparse point lists and is a heterogeneous network, which firstly extracts sparse features with its novel VFELayers and then conducts 3D and 2D convolution sequentially.




Method    3DFCN[5]  VoxelNet[6]  
Dimension  2D  3D  3D  
Input  Dense Grid  Dense Grid  Sparse Point List  
Network  Dense Conv  Dense Conv  Heterogeneous  
Pepiline  One/TwoStage  OneStage  OneStage 
Imagebased object detectors can be classified into twostage and onestage detectors. For twostage detectors, like RCNN [8], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as the background using a CNN. The twostage detectors [9, 10] achieve stateoftheart accuracy on the COCO benchmark. The onestage detectors, on the other hand, aim to simplify the pipeline like YOLO [11] and SSD [12]. They improve the speed of the network and also demonstrate promising results in terms of accuracy.
Lin et al. explores both onestage and twostage detectors in the imagebased object detection, and claims that the hurdle that obstructs the onestage detectors from better accuracy is the extreme forebackground class imbalance encountered during training of dense detectors [1]. They reshaped the standard cross entropy loss and proposed focal loss such that the losses assigned to wellclassified examples were downweighted. It can be seen as a hardmining improvement of binary cross entropy to help network focus on hard classified objects in case they are overwhelmed by a large amount of easily classified objects.
We extend focal loss to the task of 3D object detection to solve the problem of forebackground imbalance. Different from imagebased detection, pointcloudbased object detection is a more challenging perception problem in 3D space with sparse sensor data and suffers from more serious forebackground imbalance. To thoroughly evaluate the performance of focal loss in this harder task, we conduct experiments based on two different type of onestage 3D detectors: 3DFCN and VoxelNet. We analyze the focal loss effect on these two 3D detectors following the similar method as [1], and further discuss the decreasing confidence effect of focal loss.
In this section, we firstly declare notations and revisit focal loss, and then analyze the forebackground imbalance in 3D object detecion.
We define as the groundtruth class, and as the estimated probability for the class with label . For notational convenience, we define :
(1) 
The binary cross entropy loss (BCE loss) and its deviation w.r.t. x ^{1}^{1}1p is calculated from . can be formulated as
(2)  
(3) 
As claimed in [1], when the network is trained with BCE loss, its gradient will be dominated by vast easy negative samples if a huge forebackground imbalance exists. Focal loss can be considered as a dynamically scaled cross entropy loss, which is defined as
(4)  
(5) 
The loss contribution of wellclassified samples () will be downweighted. The hyperparameter of focal loss can be used to tune the effect of focal loss. As increases, less easy classified samples will contribute loss. When reaches , focal loss degrades into BCE loss (Figure. 1). In the following section, all the cases with represent BCE loss cases.
Researchers previously either introduced hyperparameters to balance the losses calculated from positive and negative anchors, or normalized positive and negative losses by the frequency of corresponding anchors, such that all sublosses can be balanced into a same order of magnitudes. However, one important thing these two previous methods cannot handle is the gradient salience of hard negative samples that the gradients of hard negative anchors () are overwhelmed by a large amount of easy negative anchors (). Due to dynamic scaling with confidence , a weighted focal loss can be used to handle both forebackground imbalance and gradient salience of hard negative samples with the following form.
(6) 
where is induced to weight different classes. In the following sections, we adopt the weighted focal loss form and adopt hyperparameter for positive focal loss and for negative focal loss.
The methods for 3D object detection can be classified as twostage [2, 3, 4] and onestage detectors [5, 6]. The twostage detectors first adopt an algorithm with a high recall rate to propose regions that possibly contain objects and adopt a convolution network to classify classes and regress bounding boxes. The onestage detectors are endtoend networks that learn representations and implement classification and regression in all anchors.
In onestage methods, anchors are proposed at each location, thus a huge forebackground imbalance exists. For instance, there are 50k bounding boxes proposed in each frame for 3D FCN while 70k for VoxelNet, but less than 30 anchors among them contains positive objects (i.e. car, pedestrian, cyclist). In contrast, the firststage proposal can help alleviate the forebackground imbalance in twostage methods, since it only proposes hundreds of bounding boxes with a high recall rate. The onestage methods for 3D detectors are different from those 2D detectors, because of its larger searching space, different type of network architectures and sparse input. Therefore, we select two different networks: 3DFCN and VoxelNet to conduct experiments to evaluate focal loss performance in 3D object detection. The feature of these two 3D detectors will be discussed in the following two sections, and the experiments details and results will be shown in Sec. VI.
In this section, we claim the dense convolution network architecture of 3DFCN and introduce our enhanced loss function for 3DFCN. The details of 3DFCN can be referred to [5], and our implementation of 3DFCN can be found in the APPENDIX.
3DFCN [5] draws the experiences from imagebased recognition tasks, and extends the 2D convolution layer to 3D space to acquire translation invariance. The input point cloud is firstly voxelized into a 3D dense grid. In each voxel of the 3D dense grid, the values are used to present whether there is any point observed. The network architecture of 3DFCN is shown in Figure. 2. The voxelized point cloud is convolved by four Conv3D blocks sequentially. The output features then separately processed by two Conv3D to generate a probability map and a regression map (PMap and RMap). Different from imagebased object detection, the probability map and regression map are all in 3D dense grids, so that the searching space is exponentially increased.
The origin loss function for 3DFCN is in a simple version [5], where only classification and regression loss are balanced. We adopt the loss used in [6], which normalizes subloss with corresponding frequency as well as introduce hyperparameters , , and to balance them, so that positive and negative classification loss as well as classification and regression loss could be in a same order of magnitudes. The loss function is in the following form.
(7)  
(8)  
(9)  
(10)  
(11) 
where and represent the classification loss and regression loss, while and represent the number of positive and negative voxels respectively. In regression loss , and are the regression output and groundtruth for positive anchors, while denotes the square loss. In classification loss , refers to the binary cross entropy Eq. 2 or focal loss Eq. 6, while and represent the confidence of positive and negative estimation respectively.
In this section, we claim the heterogeneous network architecture of VoxelNet, and its bird’seyeview estimation. The details of VoxelNet can be referred to [6], and our implementation of VoxelNet can be found in the APPENDIX.
The heterogeneous architecture overview of VoxelNet is shown in Figure. 3. It consists of three main parts: FeatureNet (pointwise and voxelwise feature transformation), MiddleLayer (3D dense convolution) and RPN (2D dense convolution).
FeatureNet extracts features directly from sparse point lists. It adopts Voxel Feature Encoding Layers (VFELayer) [6] to extract both pointwise and voxelwise features directly from points, where fully connected layers are used to extract pointwise features and symmetric function is used to aggregate local features from all points within a local voxel. Compared to suboptimally deriving handcrafted features from voxels (e.g. binary value representing nonempty voxels), VFELayers are able to learn optimal representations minimizing the loss function.
The derived voxelwise representations from VFElayers are sparse. The sparse representation saves memory and computation costs. In contrast, if a point cloud of KITTI dataset is partitioned into a dense grid for vehicle detection, only around 5300 voxels (about 0.3%^{2}^{2}2Nonempty voxels in this dense grid .) are nonempty. However, the sparse representation is currently unfriendly to convolutional operation. In order to implement convolution, VoxelNet compromises some efficiency and converts the sparse representation to a dense representation. Each sparse voxelwise representation is copied to its specific entry in the dense grid.
MiddleLayer consumes the 3D dense grid and converts it to a 2D bird’seyeview form, so that further process could be done in 2D space. The role of MiddleLayer is to learn features from all voxels in the same bird’seyeview location. Therefore, the 3D convolutional kernel is of size , if we denote the dense grid in the order of . The 3D kernel of size helps aggregate voxelwise features within a progressively expanding receptive field along the zaxis. and keeps the shape in the dimension.
RPN
predicts probability and regression map from the 2D bird’seyeview feature map. It does not utilize maxpooling and adopts skiplayers
[13]to combine highlevel semantic features and lowlevel spatial features. We interpret this design that the increased invariance and large receptive fields of toplevel nodes will yield smooth responses which cause inaccurate localization, even though strides and maxpooling provide deep convolutional neural networks with spatial invariance.
The final probability and regression estimation map are all in bird’seyeview form, which is similar to the final estimation of imagebased detection methods. It saves both memory and calculation compared to the 3D maps, but it can only estimate one object per location in bird’s eye view. It is acceptable in autonomous driving scenes but will meet problems in the indoor scenes where objects can be stacked up (eg. the mug on the book).
MiddleLayer saves calculation for further process by aggregating 3D dense grid into a 2D bird’seyeview feature map. Otherwise, throughly 3D dense convolution in such a deep network (22 convolution layers) will bring exponentially more parameters and calculation. We note that currently MiddleLayer is still a bottleneck of the whole network as GFLOPs in Table. VI because of its 3D dense convolution operation. The efficient sparse convolutional implementation is still an open problem and deserves effort.
Bird’s Eye View AP (%)  3D Detection AP (%)  

Easy  Mod  Hard  Easy  Mod  Hard  
0  32.11  31.67  27.78  24.22  21.96  18.63 
0.1  37.53  35.15  30.61  28.24  24.73  24.80 
0.2  38.10  35.32  30.65  27.75  23.88  20.36 
0.5  33.59  32.61  28.59  24.76  22.34  19.04 
1  42.91  38.21  32.96  32.26  26.70  22.58 
2  43.32  38.45  33.09  32.91  27.23  22.81 
5  25.18  24.38  20.62  18.77  16.47  17.27 
Bird’s Eye View AP (%)  3D Detection AP (%)  

Easy  Mod  Hard  Easy  Mod  Hard  
0  85.26  61.35  60.97  70.54  55.11  48.79 
0.1  85.93  69.65  68.83  72.67  56.31  56.11 
0.2  82.55  60.42  60.23  72.66  56.67  50.41 
0.5  86.80  69.40  61.79  75.86  58.28  57.92 
1  87.28  70.46  61.93  74.16  57.01  56.20 
2  84.48  68.76  61.04  70.82  55.25  54.67 
5  80.48  62.56  53.76  75.04  50.85  50.53 
In this section, we intend to answer two questions: 1) Can focal loss help improve accuracy in 3D object detection task? 2) Does focal loss play an equal effect in 3D object detection as in 2D? To answer 1), we conduct experiments to compare the performance of 3DFCN and VoxelNet trained with BCE loss and focal loss on challenging KITTI benchmark [14]. To answer 2), we analyze the cumulative distribution curve of 3DFCN and VoxelNet following the similar method as [10].
The KITTI 3D object detection dataset contains 3D annotations for car, pedestrian and cyclist in urban driving scenarios. The sensor setup mainly consists of a wideangle camera and a Velodyne LiDAR (HDL64E), both of which are wellcalibrated. The training dataset contains 7481 frames which are with raw sensor data and annotations. We follow [4] and split the dataset into training and validation sets, each containing around half of the entire set. For simplicity, we conduct experiment only on car class, since both 3DFCN and VoxelNet are trained classspecifically and extending it to other classes are straightforward things but tuning techniques. Besides, the focal loss is agnostic to the class of object in terms of E.q. 6.
The network details of both 3DFCN and VoxelNet are shown in Table. V and VI in the APPENDIX. We tune , and so that and as well as and could be in a same order of magnitudes. We set , , in 3DFCN and , , in VoxelNet. KITTI 3D detection dataset contains some noise annotation that empty bounding box containing no points. In order to avoid overfitting the dataset, we remove all bounding boxes containing few points (less than 10 points).
As claimed in [1]
, when training a network with focal loss from scratch, it is unstable at the beginning. Therefore, in order to stabilize training, we train the network (both 3DFCN and VoxelNet) 30 epochs with BCE loss and learning rate
. and continue training it with focal loss for another 30 epochs with specific and learning rate . We compare the results of the last epoch in Table. III and III, where the rows with represent BCE loss results, and the rows with represent focal loss results, while bolded numbers are the results that focal loss cases outperforms the BCE loss case.In general, VoxelNet outperforms 3DFCN in accuracy, since the input of VoxelNet is with original point clouds while 3DFCN voxelized the point clouds with information loss. Besides, VoxelNet is with deeper networks and the experiences in imagebased recognition tasks show that deeper networks are able to extract more useful highlevel features. In 3DFCN, focal loss helps improve accuracy in all metrics in a wide range of hyperparameters (). Focal loss provides gains from 0.5 AP to 11.2 AP in these cases. In VoxelNet, the cases with shows the gains from focal loss in all metrics. The gains range from 0.6 AP to 9.1 AP. When , both gains and degrades happen. But the best result is trained with focal loss, and gains are generally much more than degrades because the degrades are up to 2.7AP, while the gains are up to 9.1 AP. Therefore, in 3D object detection, focal loss can help improve accuracy in a wide range of (normally ), which is different from network to network.
Detector  lr  Step 




Easy  Mod  Hard  Easy  Mod  Hard  
3DFCN  0  1e2  126k  51.33  45.82  40.24  40.01  33.12  28.94  
3DFCN  2  1e2  137k  53.19  48.03  41.96  46.05  35.93  31.01  
VoxelNet  0  1e4  155k  85.80  69.04  61.32  75.52  57.84  57.23  
VoxelNet  0.2  1e4  215k  86.89  69.33  61.63  80.08  58.39  57.60 
From this table, we can get the same result that focal loss helps improve accuracy in 3D object detection. Note that all cases in this Table are the best result among all intermediate weights, thus the accuracy improvement is from focal loss instead of longer training steps.
We analyze the empirical cumulative distributions of the loss of the converged 3DFCN and VoxelNet as [1]. We evaluate all intermediate weights and select the best model which is detailed in Table. IV. We apply the two converged models trained with focal loss (row 2 and row 4 in Table. IV) to the validation dataset and sample the predicted probability for negative windows and positive windows. Then, we calculate focal loss with these probability data. The calculated focal loss is normalized such that it sums to one and sorted from low to high. We plot the cumulative distributions for 3DFCN and VoxelNet for different .
The cumulative distributions for different of 3DFCN and VoxelNet are shown in Figure. 4. In 3DFCN, approximately 15% of the hardest positive samples account for roughly half of positive loss. As increases, more of the loss gets concentrated in the top 15% of examples. However, compared to the effect of focal loss on negative samples, its effect on positive samples is minor. For , the positive and negative CDFs are quite similar. As increases, substantially more weight becomes concentrated on the hard negative examples. With (best result for 3D FCN), the vast majority of the loss comes from a small fraction of samples. As claimed in [1], focal loss can effectively discount the effect of easy negatives, so that the network will focus on learning the hard negative examples.
In VoxelNet, the condition is different. From Figure. 4 bottom row, we can see the effect of focal loss increases in both positive and negative samples as
increases. But the cumulative distribution function for negative samples are quite similar among different values of
, even though we adjust the xaxis to . It shows that VoxelNet trained with binary cross entropy is already able to handle negative hard samples. Compared with the effects on negative samples, the effects of focal loss on positive samples is stronger. Therefore, the accuracy gains of focal loss in VoxelNet are mainly from positive hard samples.From the analysis of cumulative distributions, we believe that in 3D object detection, focal loss prevents the network from forebackground imbalance and helps network alleviate hard sample gradient salience in the training process.
During experiment conduction, we found the network trained with focal loss should be set with a lower threshold for nonmaximum suppression. It inspired us to explore the effect of focal loss on output confidence. We take the models in Table. III and III, and evaluate them on the validation set. We record all the evaluation result and plot the histogram of positive bounding box probabilities. The results are shown in Figure. 5. As increases, the peak decreases and moves towards left. It demonstrates that the network trained with focal loss outputs positive estimation with less confidence. It can be understood that the objects with high confidence are easily classified objects, and the loss they contribute are downweighted with focal loss in the training process. In other words, they will be relatively ignored in the training process if they are estimated with high confidence, so that their confidence cannot be further improved. But they can also be accurately classified if we decrease the nonmaximum suppression threshold in the final output step.
In this paper, we extended focal loss of image detectors to 3D object detection to solve the foregroundbackground imbalance. We adopted two different types 3D object detectors to demonstrate the performance of focal loss in pointcloud based object detection. The experiment results show that focal loss helps improve accuracy in 3D object detection， and it prevents the network from hard sample gradient salience both for positive and negative anchors in the training process. The confidence histograms of models trained with focal loss show that outputs positive estimation with less confidence.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2018.In this appendix, we describe our implementation details about 3DFCN and VoxelNet.
The network details of 3DFCN is shown in Table. V
. Each Conv3D block in the BodyNet applies 3D convolution, ReLU and batch normalization sequentially. In the HeadNet, each conv3D block applies only 3D convolution. The sigmoid function is applied to get the posterior probability. Finally, nonmaximum suppression is applied to output the most confident one among all overlapped bounding boxes.
In the training phase, we create the ground truth for PMap by setting the objectvoxel that containing object center as 1, other nonobject voxels as 0. For the regression map, we create the ground truth by setting the objectvoxels with 24length residual vector, other nonobject voxels as 24length zeros. The 24length residual vectors are the coordinates for the 8 points of the bounding box with the fixed order as
[5].Our implementation of 3DFCN baseline is shown in row 1 of Table. IV. It is not as good as claimed in [5], the reasons are 1) we eliminated all empty groundtruth bounding boxes which contain few points (less than 10), thus some positive samples cannot be recognized. 2) we trained the network with only 60 epochs. 3) we simplified the network architecture to reduce memory costs by removing the deconvolution layers. Even though it is not stateoftheart results, the experiment results can demonstrate the focal loss helps improve accuracy in 3D object detection.
The network details of VoxelNet is shown in Table. VI. The FC block in VoxelNet consists of a linear fully connected layer, a batch normalization layer and a nonlinear (ReLU) layer sequentially. Each Conv3D block in the MiddleLayer applies 3D convolution, ReLU and batch normalization, while each Conv2D block in the RPN applies 2D convolution, ReLU and batch normalization sequentially. PMap and RMap consist of only 2D convolution layers without ReLU or batch normalization. The sigmoid function is applied to get the posterior probability. Nonmaximum suppression is also applied finally.
We adopt the original parameterization method of VoxelNet. A 3D bounding box is parameterized as , where represent the center location, are length, width, and height of the box, is the yaw rotation around the Zaxis.
For regression, the residual vector between the ground truth and an anchor is denoted as
where is the diagonal of the base of the anchor box, and superscript denotes groundtruth while denotes anchor box. The loss function for VoxelNet is similar to 3DFCN, but with SmoothL1 Loss for regression loss as [6].
Our implementation results are not as good as claimed in [6], the reasons are 1) we used a small batch size 1. Zhou et al. mentioned that their batch size was set as 16 in VoxelNet, but a normal GPU with 12 GB Memory can only support a batch size of 2. 2) we trained the network with only 60 epochs while they trained VoxelNet 120 epochs. Even though it is not stateoftheart results, the experiment results can demonstrate the focal loss helps improve accuracy in 3D object detection. Our released code and weights can help researchers easily reimplement our results.
Block Name  Layer Name  Kernel Size  Strides  Filter  GFLOPs 
Body  conv3d_1  [5,5,5]  [2,2,2]  32  25.8 
conv3d_2  [5,5,5]  [2,2,2]  64  204.9  
conv3d_3  [3,3,3]  [2,2,2]  96  16.6  
conv3d_4  [3,3,3]  [1,1,1]  96  24.9  
HeadPMap  conv3d_obj  [3,3,3]  [1,1,1]  1  0.3 
HeadRMap  conv3d_cor  [3,3,3]  [1,1,1]  24  6.2 
Block Name  Layer Name 

Strides  Filter  GFLOPs  
FeatureNet  vfe_1  32  N/A  N/A  <0.1  
vfe_2  128  N/A  N/A  <0.1  
fc_1  128  N/A  N/A  <0.1  
bn_1  N/A  N/A  N/A  /  
relu  N/A  N/A  N/A  /  
MiddleLayer  conv3d_1  [3,3,3]  [2,1,1]  64  311.5  
conv3d_2  [3,3,3]  [1,1,1]  64  93.5  
conv3d_3  [3,3,3]  [2,1,1]  64  62.3  
reshape  N/A  N/A  N/A  /  
RPN  conv2d_4  [3,3]  [2,2]  128  10.4  
conv2d_5  [3,3]  [1,1]  128  10.4  
conv2d_6  [3,3]  [1,1]  128  10.4  
conv2d_7  [3,3]  [1,1]  128  10.4  
deconv_1  [3,3]  [1,1]  256  20.8  
conv2d_8  [3,3]  [2,2]  128  10.4  
conv2d_9  [3,3]  [1,1]  128  2.6  
conv2d_10  [3,3]  [1,1]  128  2.6  
conv2d_11  [3,3]  [1,1]  128  2.6  
conv2d_12  [3,3]  [1,1]  128  2.6  
conv2d_13  [3,3]  [1,1]  128  23.6  
deconv_2  [2,2]  [2,2]  256  5.2  
conv2d_14  [3,3]  [2,2]  256  5.2  
conv2d_15  [3,3]  [1,1]  256  2.6  
conv2d_16  [3,3]  [1,1]  256  2.6  
conv2d_17  [3,3]  [1,1]  256  2.6  
conv2d_18  [3,3]  [1,1]  256  2.6  
conv2d_19  [3,3]  [1,1]  256  2.6  
deconv_3  [4,4]  [4,4]  256  2.6  
PMap  conv2d_obj  [1,1]  [1,1]  2  0.1  
RMap  conv2d_cor  [1,1]  [1,1]  14  0.8 