IoU Loss for 2D/3D Object Detection

08/11/2019 ∙ by Dingfu Zhou, et al. ∙ 8

In 2D/3D object detection task, Intersection-over-Union (IoU) has been widely employed as an evaluation metric to evaluate the performance of different detectors in the testing stage. However, during the training stage, the common distance loss (, L_1 or L_2) is often adopted as the loss function to minimize the discrepancy between the predicted and ground truth Bounding Box (Bbox). To eliminate the performance gap between training and testing, the IoU loss has been introduced for 2D object detection in yu2016unitbox and rezatofighi2019generalized. Unfortunately, all these approaches only work for axis-aligned 2D Bboxes, which cannot be applied for more general object detection task with rotated Bboxes. To resolve this issue, we investigate the IoU computation for two rotated Bboxes first and then implement a unified framework, IoU loss layer for both 2D and 3D object detection tasks. By integrating the implemented IoU loss into several state-of-the-art 3D object detectors, consistent improvements have been achieved for both bird-eye-view 2D detection and point cloud 3D detection on the public KITTI benchmark.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection, as a fundamental task in computer vision and robotics, has been well studied recently. For 2D object detection, many classical frameworks have been developed, including both two-stage methods (

e.g., fast R-CNN [9], faster R-CNN [19]) and one-stage methods (e.g., SSD [15] and YOLO [18]). Recently, with the rapid development of the range sensors, such as the LiDAR and RGB-D cameras, 3D object detection has been attracting more and more researchers’ attention. Similar with the 2D detection, some one- or two-stage based 3D object detection frameworks have been developed, such as Frustum-Pointnet [16], Voxel-net [28], SECOND [25], PointPillars [12] and Point R-CNN [21].

Figure 1: An example of 3D car detection results from different models trained with SECOND [25] and SECOND + proposed loss are shown in the left and right columns. The IoU value for each Bbox has been provided in the Bird-eye-view image at the bottom this figure. From this figure, we can find that the accuracy of the bounding boxes has been steadily improved by using the proposed IoU loss.

For easy generalization, in the detection task objects are usually represented as 2D Bboxes or a 3D cuboids with several parameters, such as Bbox’s center, dimension and orientation etc

.. Therefore, object detection problem has been transformed as a regression task by minimizing the difference between ground truth Bbox and the predicted one. Currently, with superpower of the deep neural network, most of approaches focus on designing a better architecture backbone

[28] or a better representation to extract the information of the foreground and background objects. For the loss function, they employed the common used and distance to optimize the whole network.

To compare the performance of different detectors, IoU metric is usually employed for evaluation, which is a total different metric compared with the and losses. As the name suggests, IoU (Intersection over Union) represents the area ratio of intersection to union of two shapes e.g.Bboxes. Compared with the and distance, the IoU metric has several advantages. First, all shape properties of the Bbox has been considered in the IoU computation process, e.g., location, dimension and orientation etc.. Second, the area computation process has implicitly encoded the relationship between each parameter rather than considering them as independent variables in and loss. Finally, the IoU metric is scale invariant to the problem, which is suitable to solve the scale and range difference between each parameter.

Through the analysis above, we can clearly find that there is an obvious mismatch between the objective for model training and the metric for evaluation. Frankly speaking, there is no strong correlation between the loss and the IoU metrics. Two predicted Bboxes may have the same loss with the ground truth Bbox, while the IoU value of these two Bboxes could be totally different. To eliminate this kind of gap, some efforts have been made in [27] and [20] for 2D object detection.

Unfortunately, both of them are only suitable for the easy case with axis-aligned Bboxes and none of them can be applied for the general cases with two rotated Bboxes or 3D object detection. In this paper, we explored the IoU calculation between two rotated Bboxes first and then implemented a unified IoU loss function which can be used for both axis-aligned and rotated 2D object detection. In addition, the new IoU loss can be also applied for 3D object detection which has only one freedom of degree for orientation. The main contribution of this paper can be summarized as:

  • We investigated the IoU loss computation for two rotated 2D and 3D Bboxes;

  • We provided a unified, framework independent, IoU loss layer for general 2D and 3D object detection tasks.

  • By integrating the IoU loss layer into several state-of-the-art 3D object detect frameworks such as SECOND, PointPillars and Point R-CNN, its superiority has been verified on the public KITTI 3D object detection benchmark.

2 Related Works

2.1 2D object detection

Generic object detection frameworks can mainly be divided into two directions: the first direction is also called two-stage based methods, which generate region proposals at first stage and then classify each proposal into different classes. The other one is one-stage based methods which consider the object detection as a regress and classification problem by adopting a unified framework to obtain location and classes information simultaneously. R-CNN

[8], Fast R-CNN [9], Faster R-CNN [19] and Mask R-CNN [10] are the most representative works of two-stages based methods, while MultiBox [4], YOLO [18], SSD [15], DSSD [6] are the representative works for one-stage based methods.

Although the design idea is slightly different between the one- and two-stage based framework, the Bbox parameters regression is a crucial component for both of them. For robust optimization and better regress results, different Bbox representation and loss functions have been designed. In YOLO [18], the authors proposed to directly regress the Bbox parameters for object detection. To solve the scale sensitivity, they proposed to predict square root of the bounding box size rather than itself. In R-CNN [8], the concept of prior Bbox which is also well known as proposals has been used. In this case, the Bbox regression can be transformed to predict the residual between the ground truth and the predicted Bboxes. Then

-norm is taken as the loss function for optimizing the framework. To against the outliers and noise,

-norm has been applied in Fast R-CNN [9]. After that the -norm has been taken as a standard loss in the object detection frameworks [19, 10, 15].

2.2 3D object detection

3D object detection in traffic scenario becomes more and more popular with the development of range sensor and the Autonomous Driving techniques. Inspired by 2D object detection, the point cloud is first projected into 2D (e.g.bird-eye-view [2] or front-view [24]) to obtained the 2D detection and then re-project the 2D Bbox into 3D to get the finally results. Another representative direction for 3D object detection is volumetric convolutional based methods due to the rapid development of the graphics processing resources. Voxel-net [28] is a pioneer work to detect the 3D objects directly with 3D convolutional by representing the LiDAR point cloud with voxels. For saving the GPU memory, the voxel resolution is relative large as . For each voxel, the PointNet [17] is applied to extract a 128-dimension features first. Based on the framework of Voxelnet, two variant methods, SECOND [25] and PointPillars [12] have been proposed. Different with the two directions mentioned above, PointNet [17]

is another useful techniques for point cloud feature extraction. Along this direction, several state-of-the-art methods have been proposed for 3D object detection

[16, 21]. Similar to the 2D object detection framework, the common -norm has been employed directly for 3D Bbox regression.

2.3 IoU Loss for Object Detection

Most of the frameworks used a surrogate loss (e.g., or distance loss) of IoU for Bbox regression. The drawbacks of this kind of loss function have been found in  [27, 22] and  [20]. In  [27], a novel IoU loss function for axis-aligned bounding box prediction has been introduced, which regresses the four bounds of a predicted box as a whole unit, performs accurate and efficient localization, shows robust to objects of varied shapes and scales, and converges fast. In [22], bounded IoU loss has been developed, which is proved to be better matching the goal of IoU maximization while still providing good convergence properties. Furthermore, in  [20], the authors discussed the weakness of IoU for the case of non-overlapping bounding boxes first and then introduced a generalized version of IoU (GIoU) as a new loss. Finally, the effectiveness of GIoU has been verified by integrating it into the state-of-the-art 2D object detection frameworks [19, 18, 10]. All the works mentioned above target on the axis-aligned Bbox regression task, none of the works have proposed to apply the IoU loss for rotated Bbox or 3D object detection tasks.

3 IoU for Object Detection

IoU is also known as the Jaccard index (or the Jaccard similarity coefficient) which has been widely used to measure the similarity between finite sample sets. Generally, for two finite sample sets

and , their IoU is defined as the intersection () divided by the union () of and .

(1)

As its definition in [20], IoU fulfills all properties of a metric, such as non-negativity, identity of indiscernibles, symmetry and triangle inequality. Especially, IoU is invariant to the scale which means that the similarity between two arbitrary shapes and is independent from the scale of their space. Due to these properties mentioned above, the IoU has been widely employed as evaluation metric for many task in computer vision, e.g., pixel- or instance-level image segmentation, 2D/3D object detection etc.. Particularly, we only focus on the task of object detection and its application for other tasks is beyond the scope of this paper.

3.1 IoU Definition for Object Detection

For bounding box-level object detection, the target object is usually represented by a minimum Bbox rectangle in the 2D image. Base on this representation, the IoU computation between the ground bounding box and the predicted bounding box is defined as

(2)

For 3D object detection, the Bbox is simply replaced by a cuboid and the IoU value between two cuboids can be easily obtained by changing the area with volume in Eq. (2). For simplicity, we only take 2D case into the consideration here and its extension to 3D will be introduced in the following sections.

Figure 2: IoU computation for 2D: axis-aligned and rotated bounding boxes, where the green and red represent the ground truth and predicted bounding box respectively. The intersection area is highlighted in gray.
1: -Corners of the two bounding boxes: , , where , and ,
2: - value;  
3:The area of : ;
4:The area of : ;
5:The area of overlap: ;
6: ;
Algorithm 1 IoU for two axis-aligned BBoxes.

3.2 Axis-aligned BBox

Usually, objects are labeled with axis-aligned BBoxes in most of the 2D object detection benchmarks, such as Pascal Visual Object Classes (VOC) Challenge [5], COCO [14] and KITTI [7]. By taking this kind of labels as ground truth, the predicted Bboxes are also axis-aligned rectangles. For this case, the IoU computation is very easy, which can be implemented with some basic math functions, such as “max” and “min” etc.. The left of Fig. 2 illustrates an example of intersection between two axis-aligned Bboxes where the shadow area represents the intersection area. The pseudo-code of IoU computation for the axis-aligned case is given in Alg. 1.

3.3 Rotated BBox

However, the axis-aligned box is not suitable for representing the target objects in 3D, such as the objects in the LiDAR point cloud. Usually, the 3D object is represented by a 3D cuboid. For autonomous driving scenario, the general 3D BBox with three degree-of-freedoms for rotation can be reduced to one (

e.g., “yaw” angle) by assuming that all the objects should lay on a relative flat road ground. This kind of representation is widely used in most of the popular 3D object detection benchmarks, such as KITTI [7] and nuScenes [1]. An example of labeled 3D object in KITTI data is given in Fig. 1.

For evaluation of different methods, two different strategies have been provided in KITTI: 2D Bbox overlap by projecting the 3D objects into the Bird-Eye-View (BEV) or 3D Bbox overlap directly. Here, we discussed the 2D case first and the 3D case is similar to the 2D case by adding a height dimension simply. In the BEV image, objects are represented with rotated BBoxes as described in the bottom of Fig. 1. The IoU computation for two rotated rectangles is more complex than axis-aligned ones because they can be intersected in many different ways. A typical example of intersection of two rotated rectangles is shown at the right of Fig. 2 and the overlap part is highlighted in blue. How to get the area of overlap part is the critical step for IoU computation. The pseudo-code for IoU computation with two rotated BBoxes is given in Alg. 2.

1: -Corners of the two bounding boxes:
2: - value;  
3:Compute the area of : , where and ;
4:Compute the area of : where and ;
5:Determine the vertexes of overlap area if they have.
6:Sort these polygon vertexes in anticlockwise order;
7:Compute the intersection area ;
8: ;
Algorithm 2 IoU for two rotated BBoxes.

3.4 3D Bboxes

As we have mentioned before, 3D object in autonomous driving is usually represented by a 3D Bbox with seven parameters, which are three for location, three for dimension and one for rotation. In this case, the IoU for two 3D Bboxes can be calculated as

, (3)

where and represents the intersection and union in the height direction.

4 IoU Loss for 2D/3D BBox Regression

So far, we have introduced IoU as a metric for two 2D and 3D BBoxes evaluation. Recently, some pioneers have succeeded in integrating the IoU loss [27, 20] for BBox regression in popular 2D object detection frameworks [9, 10, 18]. Unfortunately, both of them can only handle two axis-aligned BBoxes and none of works have been proposed to deal with more general cases, such as two rotated BBoxes or 3D object detection. As we have discussed in the previous section, the computation of intersection between two rotated BBoxes is not trivial and there is not an off-the-shelf implementation in the existing deep-learning frameworks. To well rectify this situation, we first investigated the IoU loss for two rotated BBoxes and then implemented it as an unified loss layer for both 2D and 3D object detection frameworks.

4.1 IoU as Loss

In [27] and [20], the effectiveness of IoU as loss function has been well proved for 2D axis aligned BBox regression task. Theoretically, it should also work well for rotated BBox because the only difference is the computation process for rotated ones is more complex than axis-aligned ones. Similar with [20], we defined the IoU loss as

(4)

Because satisfies , then the is also bounded between 0 and 1.

4.2 IoU Loss Layer

Currently, the IoU loss for two rotated Bboxes has not implemented in any deep learning frameworks. Therefore, we implement both the forward and backward operations for this IoU loss layer.

4.2.1 Forward

As described in Alg. 2, the forward process includes the following steps:

  1. [leftmargin=0cm,itemindent=.5cm,labelwidth=labelsep=0cm,align=left]

  2. Compute the areas for and , where and represent the predicted and ground truth BBoxes respectively;

  3. Determine the vertexes of intersection area between and , which come from two ways: one is from the intersections of two BBoxes’ edges and the other is from the BBoxes’ corner who is inside the other BBox. The IoU value is zero if the vertexes don’t exist.

  4. Theoretically, these vertexes form a convex hull. For computation the area of this convex hull, we need sort the vertexes in anticlockwise (or clockwise) order. First of all, the center point of these vertexes is computed. Then, the rotation angle formed by each vertex and the center is calculated. Finally, the vertexes can be sorted by the rotation angles.

  5. Then, the intersection area is obtained by dividing it into small individual triangles.

  6. Compute the IoU value based on Eq. (2) and the via Eq. (4).

4.2.2 Backward

Currently, the derivative of common functions has been implemented in most of the public deep learning frameworks and the back-propagation process can be automatically triggered by calling these derivative computation functions. However, the analytical solution of the IoU calculation process is not easy to be provided due to the complexity of intersection between two rotated Bboxes. Especially, there exist some custom operations (intersection of two edges and sorting the vertexes etc.) whose derivative functions have not been implemented in the existing deep learning frameworks. Finally, we implement the backward operations for all these functions and we will make the source code public in the future.

4.2.3 Extension to GIoU Loss

As a generalized version of IoU, GIoU has been proposed in [20] to handle the case that two shapes don’t have an intersection. In GIoU, a definition has been given to determine the distance between two non-intersected Bboxes. Generally speaking, for any two convex shapes , , a minimum area bounding shape is defined as: the smallest convex shapes enclosing both and . Usually, should shares the same shape type with and for easy computation. Finally, the GIoU is defined as

(5)

where . Similar as IoU loss, we also extended the GIoU loss for the case of rotated Bboxes.

5 Experimental Results

Loss Types AP70 AP75 AP80 mAP
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
SECOND[25] + 88.15 78.33 77.25 81.37 66.86 65.48 59.56 48.90 44.45 62.41 57.52 56.23
SECOND + 89.16 78.99 77.78 83.40 73.36 66.72 66.36 52.60 50.61 64.55 58.96 57.61
Rel improvement 0.94% 0.82% 0.91% 2.49% 9.72% 1.89% 11.42% 7.57% 13.86% 3.43% 2.50% 2.45%
SECOND + 89.15 79.14 78.11 82.56 72.98 66.34 64.27 51.67 50.11 64.38 58.73 57.20
Rel improvement 1.13% 1.03% 1.11% 1.46% 9.15% 1.31% 7.91% 5.66% 12.73% 3.16% 2.10% 1.72%
Table 1: Evaluation results by training SECOND [25] with loss and proposed losses on validation dataset of the KITTI 3D car detection benchmark. All the numbers are the higher the better. The best result of each column has been highlighted with bold font.
Loss Types BEV(AP70) mAP
Easy Mod Hard Easy Mod Hard
SECOND[25] + 89.92 87.88 86.72 70.63 67.71 65.82
SECOND + 90.21 88.25 87.56 71.39 68.37 66.23
Rel improvement 0.32% 0.42% 0.97% 1.07% 0.97% 0.62%
SECOND + 90.25 88.51 87.65 73.35 68.48 66.92
Rel improvement 0.37% 0.72% 1.07% 3.85% 1.14% 1.67%
Table 2: Evaluation results of SECOND [25] with and IoU losses on validation dataset of the KITTI BEV car detection benchmark. The number is the higher the better. The best result of each column is highlighted with bold font.

The proposed loss layer is an framework independent modular which can be integrated into any regression-based 2D or 3D object detection methods. Different with [20], the proposed loss layer is more general on both axis-aligned and non-axis-aligned cases, such as 2D BEV or 3D object detection. We integrate the proposed IoU/GIoU loss on different types of 3D object detection frameworks and then compare their performances on the public third-party 3D object detection benchmark.

Baselines: three state-of-the-art 3D object detectors have been evaluated here: SECOND[25], PointPillars [12] and PointRCNN [21]. SECOND is a voxel-based one-stage object detector, which is an advance version of VoxelNet [28] by adding a sparse convolution operations implemented by themself. PointPillars is an acceleration version of SECOND which represents the point cloud by pillars rather than voxels. First, PointNet[17] is employed to extract features for each “Pillar” and then the “Pillar” is taken as the minimum elements for the further convolution network. Compared with SECOND, the pillar expression is much faster than voxel representation. Different with the previous two methods, PointRCNN is a two-stage 3D object detector, which combines the point segmentation and region proposal at the first stage and the Bbox refinement is executed at the second stage of the framework.

Dataset: we train all the baselines and evaluate them on KITTI [7] 3D object detection benchmark. This data has been divided into training and testing two subsets, which consists of 7481 and 7518 frames respectively. Since the ground truth for the testing set is not available, we subdivide the training data into a training and validation set as described in [28, 25]. Finally, we obtained 3,712 data samples for training and 3,769 data samples for validation. On the KITTI benchmark, the objects have been categorized into “easy”, “moderate” and “hard” based on their height in the image and occlusion ratio, etc. For each frame, both the camera image and the LiDAR point cloud has been provided, while only the point cloud has been used for our object detection here and the RGB image is only used for visualization.

Evaluation protocol: In this paper, we employ similar evaluation metric as KITTI [7] to report all our results. In [7], all the objects have been divided into “Easy”, “Moderate” and “Hard” category based on their distances and occlusion ratios. For each category, we calculate the Average precision (AP) by giving a certain IoU threshold. Different with KITTI, we set three different thresholds here. Beside this, we also give the mean Average Precision (mAP) across different value of IoU thresholds, i.e. to evaluate the performance of detectors at different thresholds.

Figure 3: An example of 3D car detection results with different methods, where the left is from original SECOND method and the right is the SECOND with the proposed IoU loss.
Loss Types AP70 AP75 AP80 mAP
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
PointPillars [12] + 87.29 76.99 70.84 72.39 62.73 56.40 47.23 40.89 36.31 58.62 54.86 52.74
PointPillars + 87.88 77.92 75.70 76.18 65.83 62.12 57.82 45.03 42.95 62.07 57.11 55.67
Rel improvement 0.68% 1.21% 6.86% 5.24% 4.94% 10.14% 22.4% 10.1% 18.28% 5.89% 4.10% 5.56%
PointPillars + 88.43 78.15 76.34 76.93 66.36 63.68 56.36 44.43 42.72 61.94 56.65 55.13
Rel improvement 1.34% 1.47% 7.62% 6.27% 5.78% 12.9% 19.3% 8.66% 17.65% 5.53% 2.44% 4.17%
Table 3: Evaluation results by training PointPillar [12] with loss and proposed losses on validation dataset of the KITTI 3D car detection benchmark. All the numbers are the higher the better. The best result of each column has been highlighted with bold font.

5.1 Second [25]

Training protocol: the officially released code 111https://github.com/traveller59/second.pytorch by the authors has been used for training the baseline SECOND model. We use exactly the same config file provided by the author and follow the same training protocol to achieve the baseline results on the KITTI benchmark. Compared with the baseline network, we just simply replace the regress loss with our self-implemented and losses. We used nearly the same training strategy as the baseline e.g., same iteration steps and learning rate etc.. The only difference is that we decrease the threshold of an anchor be considered as a positive sample during the training from 0.6 to 0.5, which means that there are more positive anchors have been involved in the training process. We set threshold at 0.6 in baseline framework because it gives better results than 0.5.

Results: The comparison of the IoU losses with original SECOND method for 3D car detection on KITTI benchmark has been given in Tab. 1. On this benchmark, the matching IoUs threshold is used to evaluation, however, as mentioned by [22], the hyper-parameters (e.g., the matching IoUs threshold) usually have big influences on the detectors if only a certain matching IoUs threshold. Therefore, three different matching thresholds and the mAP have been applied here for evaluation.

From this table, we can find that the proposed and gives slightly better results than baseline for all the three categories (“easy”, “moderate” and “hard”) at the IoUs matching threshold . Compared with the baseline, around 1% relative improvement has been given by and for the three categories. At this threshold, the performs slightly better than .

We also find an interesting phenomenon that the and losses give much more improvements than baseline when the IoUs matching threshold at a higher value. We can see clearly that the improvements at are much greater than . At , the relative improvements for the three categories can reach 11.42%, 7.57% and 13.86% respectively by using the loss. At this threshold, the improvements for loss can achieve to 7.91%, 5.66% and 12.73% which performs slightly worse than .

The mAPs for all methods have been given in the last column of this table. We can also easily find that the detection performance has been steadily improved by the and losses. By using the new loss, all the detection rates have an average improvement of 2% and the improvement can reach 3% for some specific category.

Loss Types BEV(AP70) mAP
Easy Mod Hard Easy Mod Hard
PointPillars [12] + 90.07 87.06 83.81 69.11 66.84 65.36
PointPillars + 90.24 88.02 86.64 71.33 68.11 66.53
Rel improvement 0.19% 1.10% 3.38% 3.21% 1.90% 1.79%
PointPillars + 90.35 88.26 87.04 71.74 68.04 66.63
Rel improvement 0.31% 1.37% 3.85% 3.81% 1.80% 1.94%
Table 4: Evaluation results by training PointPillars [12] with loss and proposed losses on KITTI validation dataset for BEV image. The best result of each column has been highlighted with bold font.

The detection results of BEV image is given in Tab. 2. Compared with the baseline, we can also find that the detection rate has been slightly improved with the proposed IoU loss for all the three categories. An example of detection results in BEV image and point cloud is given in Fig. 3. The bottom of this figure gives the 2D detection in BEV image, where the number around each Bbox is the IoU value in 3D. We can found that most of the values in right is larger than left, which means that the bounding box’s accuracy has been consistently improved by the proposed IoU loss.

5.2 PointPillar [12]

Training protocol: the officially released code 222https://github.com/nutonomy/second.pytorch by the authors has been used for training the PointPillars baseline model. We reproduce the baseline results on the KITTI benchmark, following the officially configure file and training protocols. Similar with SECOND method, we replaced the regression loss with our proposed and losses and decrease the foreground threshold from 0.6 to 0.5.

Results: The comparison of the proposed losses with original PointPillars is given in Tab. 3. The similar evaluation criterion is applied here too. From Tab. 3, the power of the proposed losses is demonstrated clearly. For , the detection rates have been improved around 1% for “easy” and “moderate” categories, 6.86% and 7.63% for “hard” category with and losses respectively. For , the proposed loss can achieve a significant improvement by 22.4%, 10.1%, 18.28% on the three categories, for the proposed , which performs slightly worse, but also inspiring, promoted the baseline by 19.3%, 8.66%, 17.65% respectively.

The mAP values are shown in the last column, which have been steadily improved by over 4% roughly for both the and losses compared with the baseline. And the improvement can reach 5% for some specific categories.The detection results on the BEV images are shown in Tab. 4. From this table, we can also find steadily improvement with the proposed and losses.

5.3 PointRCNN [21]

Training protocol: different with the previous two methods, PointRCNN is a two-stage based framework. At the first stage, RPN network is employed to generate region proposal first and the Bbox refinement is executed at the second stage to get the final results. We trained the baseline with the official released source code here 333https://github.com/sshaoshuai/PointRCNN. Currently, we kept the RPN part unchanged and integrated the proposed IoU loss only at the second stage. The loss function in the stage includes class classification loss, bin classification loss and the regression loss. Here, we keep the first two parts unchanged and replace the regression loss with the proposed and . To be clear, based on the officially released code, we cannot obtain the results reported in their paper. Therefore, the baseline reported here is the best model that we can achieve. Particularly, we train the baseline and the proposed losses by using the same training strategies for fair comparison.

Loss Types AP70 AP75 AP80 mAP
Easy Mod Hard Easy Mod Hard Easy Mod Hard Easy Mod Hard
PointRCNN [21] 88.14 77.58 75.36 73.27 63.54 61.08 44.21 38.88 34.62 59.44 54.35 52.79
PointRCNN + 88.83 78.80 78.18 77.42 67.83 66.85 58.22 49.09 45.38 63.47 57.71 56.67
Rel improvement 0.78% 1.57% 3.74% 5.66% 6.75% 9.44% 31.6% 26.26% 31.08% 6.78% 6.18% 7.35%
PointRCNN + 88.84 78.85 78.15 77.47 67.98 67.18 59.80 51.25 46.50 63.12 57.96 56.92
Rel improvement 0.79% 1.64% 3.70% 5.73% 6.99% 9.99% 35.3% 31.81% 34.31% 6.19% 6.64% 7.82%
Table 5: Evaluation results by training PointRCNN [21] with loss and proposed losses on KITTI validation dataset for BEV image. The best result of each column has been highlighted with bold font.

Results: the comparison of PointRCNN with different losses is given in Tab. 5. Similar to the previous methods, we can easily find that both the and can improve baseline’s performance at different IoU threshold for all the categories. Especially, the detection rates have been improved by a big margin when we have a higher IoU threshold. Furthermore, for the mAP criterion, both the and also give a big improvement compared with the original PointRCNN. Based on this experiment, we can conclude that the proposed IoU loss can also work for two-stage based method.

5.4 Comparison with Other Methods

Methods Modality AP70
Easy Mod Hard
MV3D[3] LiDAR+Mono 71.29 62.68 56.56
F-PointNet[16] LiDAR+Mono 83.76 70.92 63.65
AVOD-FPN[11] LiDAR+Mono 84.41 74.44 68.65
ContFusion[13] LiDAR+Mono 86.33 73.25 67.81
IPOD [26] LiDAR+Mono 84.10 76.40 75.30
F-ConvNet[23] LiDAR+Mono 89.02 78.80 77.09
VoxelNet[28] LiDAR 81.97 65.46 62.85
PointPillars [12] LiDAR 87.29 76.99 70.84
PointRCNN[21] LiDAR 88.88 78.63 77.38
SECOND[25] LiDAR 88.15 78.33 77.25
SECOND+ LiDAR 89.16 78.99 77.78
SECOND+ LiDAR 89.15 79.14 78.11
Table 6: Comparison with other public methods on the KITTI validation dataset for 3D “Car” detection. For easy understanding, we have highlighted the top two numbers in bold and italic for each column. All the numbers are the higher the better.

In the above subsections, we have compared the implemented new loss with loss based on different baselines. In this subsection, We compare the improved baseline with state-of-the-art methods of 3D object detection on both val split and test split of KITTI [7] 3D object detection benchmark. First of all, Tab. 6 gives the comparison results on validation dataset. We have listed nearly all the top results with publications here including: multi-modalities fusion-based [3, 16, 11, 13, 26, 23], one-stage- [25, 28, 12] and two-stage-based [21] approaches. Among all the methods, the improved baseline with and achieved the best results on all the three categories and it even performs much better than other fusion-based and two-stage-based methods.

Methods Modality AP70
Easy Mod Hard
MV3D[3] LiDAR+Mono 71.09 62.35 55.12
F-PointNet[16] LiDAR+Mono 81.20 70.29 62.19
AVOD-FPN[11] LiDAR+Mono 81.94 71.88 66.38
ContFusion[13] LiDAR+Mono 82.54 66.22 64.04
IPOD [26] LiDAR+Mono 79.75 72.57 66.33
F-ConvNet[23] LiDAR+Mono 85.88 76.51 68.08
VoxelNet[28] LiDAR 77.47 65.11 57.73
PointPillars [12] LiDAR 79.05 74.99 68.30
PointRCNN[21] LiDAR 85.94 75.76 68.32
SECOND[25] LiDAR 84.04 75.38 67.36
SECOND+ LiDAR 84.43 76.28 68.22
Table 7: Comparison with other public methods on the KITTI testing dataset for 3D “Car” detection. For easy understanding, we have highlighted the top two numbers in bold and italic for each column. All the numbers are the higher the better.

Tab. 7 gives the evaluation results on the KITTI testing benchmark. We achieved the results on testing split submitting the results on KITTI’s online evaluation server and the results of other methods are obtained from their publications respectively. One important thing is that our model submitted to the test server is trained with half-half split as used on the validation dataset rather than using a bigger training split (e.g., [12]). From the table, we can find that the proposed loss improved the performance of the baseline [25] for all the three types. Especially for “moderate” and “hard” categories, the improvement nearly reaches one point. Furthermore, for the “moderate” and “hard” types, the proposed loss achieved comparable or even better results with the state-of-the-art sensor fusion-based [23] or two-stages-based methods [21].

6 Conclusion and Future Works

In this paper, we have addressed the 2D/3D object detection problem by introducing the IoU loss for two rotated Bboxes. We proposed a unified framework independent IoU loss layer which can be directly applied for axis-aligned or rotated 2D/3D object detection frameworks. By integrating this IoU loss layer into several state-of-the-art 3D object detectors, consistent improvements have been achieved for both 2D detection in bird-eye-view and 3D object detection in point cloud. Especially, the proposed IoU loss performs much better when the IoU threshold is set at a high value. In the future, we would like to extend the current IoU loss layer to more general 3D object detection cases, e.g., Bboxes with three orientation parameters.

References

  • [1] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §3.3.
  • [2] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3d object detection for autonomous driving. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2147–2156. Cited by: §2.2.
  • [3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §5.4, Table 6, Table 7.
  • [4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov (2014) Scalable object detection using deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2147–2154. Cited by: §2.1.
  • [5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §3.2.
  • [6] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §2.1.
  • [7] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: IoU Loss for 2D/3D Object Detection, §3.2, §3.3, §5.4, §5, §5.
  • [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.1, §2.1.
  • [9] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1, §2.1, §2.1, §4.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.1, §2.1, §2.3, §4.
  • [11] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: §5.4, Table 6, Table 7.
  • [12] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2018) PointPillars: fast encoders for object detection from point clouds. arXiv preprint arXiv:1812.05784. Cited by: §1, §2.2, §5.2, §5.4, §5.4, Table 3, Table 4, Table 6, Table 7, §5.
  • [13] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. Cited by: §5.4, Table 6, Table 7.
  • [14] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.2.
  • [15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §2.1, §2.1.
  • [16] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §1, §2.2, §5.4, Table 6, Table 7.
  • [17] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §2.2, §5.
  • [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1, §2.1, §2.1, §2.3, §4.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1, §2.1, §2.3.
  • [20] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. arXiv preprint arXiv:1902.09630. Cited by: IoU Loss for 2D/3D Object Detection, §1, §2.3, §3, §4.1, §4.2.3, §4, §5.
  • [21] S. Shi, X. Wang, and H. Li (2019) PointRCNN: 3d object proposal generation and detection from point cloud. In CVPR, Cited by: §1, §2.2, §5.3, §5.4, §5.4, Table 5, Table 6, Table 7, §5.
  • [22] L. Tychsen-Smith and L. Petersson (2018) Improving object localization with fitness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6877–6885. Cited by: §2.3, §5.1.
  • [23] Z. Wang and K. Jia (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. arXiv preprint arXiv:1903.01864. Cited by: §5.4, §5.4, Table 6, Table 7.
  • [24] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §2.2.
  • [25] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: Figure 1, §1, §2.2, §5.1, §5.4, §5.4, Table 1, Table 2, Table 6, Table 7, §5, §5.
  • [26] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2018) IPOD: intensive point-based object detector for point cloud. arXiv preprint arXiv:1812.05276. Cited by: §5.4, Table 6, Table 7.
  • [27] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang (2016) Unitbox: an advanced object detection network. In Proceedings of the 24th ACM international conference on Multimedia, pp. 516–520. Cited by: IoU Loss for 2D/3D Object Detection, §1, §2.3, §4.1, §4.
  • [28] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §1, §1, §2.2, §5.4, Table 6, Table 7, §5, §5.