adopted the pipeline similar to 2D detectors and mainly focused on RGB features extracted from 2D images. However, these features are not suitable for 3D related tasks because of the lack of spatial information. This is one of the main reasons why early studies failed to get better performance. An intuitive solution is that we can use a CNN to predict the depth maps and then use them as input if we do not have the depth data available. Although depth information is helpful to 3D scene understanding, simply using it as an additional channel of RGB images such as
does not compensate for the performance difference between image-based methods and LiDAR-based method. There is no doubt that LiDAR data is much more accurate than estimated depth, here we argue that the performance gap not only due to the accuracy of the data, but also its representation (see Fig.1 for different input representations on monocular 3D detection task). In order to narrow the gap and and make the estimated depth a bigger role, we need a more explicit representation form such as point cloud which describes a real world 3D coordinates rather than depth with a relative position in images. For example, objects with different positions in 3D world may have the same coordinates in image plane, which brings difficulties for the network to estimate the final results. The benefits for transform depth map into point cloud can be enumerated as follow: (1) Point cloud data shows the spatial information explicitly, which make it easier for network to learn the non-linear mapping from input to output. (2) Richer features can be learnt by the network because some specific spatial structures exist only in 3D space. (3) The recent significant progress of deep learning on point clouds provides a solid building brick, which we can estimate 3D detection results in a more effective and efficient way.
Based on the observations above, a monocular 3D object detection framework is proposed. The main idea for the design of our method is to find a better input representation. Specifically, we first learn to use front-end deep CNNs and the input RGB data to produce two intermediate tasks involving 2D detection and depth estimation (see Fig. 2). Then, we transform depth maps into point clouds with the help of camera calibration files in order to give the 3D information explicitly and used them as input data for subsequent steps. Besides, another crucial component that ensures the performance of proposed method is multi-modal features fusion module. After aggregating RGB information which is complementary to 3D point clouds, the discriminative capability of features used to describe 3D object are further enhanced. Note that, when the optimization of the all networks are finished, the inference phase is only based on the RGB input.
The contributions of this paper can be summarized as:
We propose a new framework for monocular 3D object detection which transforms t 2D image to 3D point cloud and performs the 3D detection effectively and efficiently.
We design an features fusion strategy to fully exploit the advantages of RGB cue and point cloud to boost the detection performance, which can be also applied in other scenarios such as LiDAR-based 3D detection.
Evaluation on the challenging KITTI dataset  shows our method outperform all state-of-the-art monocular methods by around 15% and 11% higher AP on 3D localization and detection tasks, respectively.
2 Related Work
We briefly review existing works on 3D object detection task based on LiDAR and images in autonomous driving scenario.
Image-based 3D Object Detection: In the early works, monocular-based methods share similar framework with 2D detection , but it is much more complicated for estimating the 3D coordinates (x, y, z) of object center, since only image appearance cannot decide the absolute physical location. Mono3D  and 3DOP  focus on 3D object proposals generation using prior knowledge (e.g., object size, ground plane) from monocular and stereo images, respectively. Deep3DBox  introduces geometric constraints based on the fact that the 3D bounding box should fit tightly into 2D detection bounding box. Deep MANTA  encodes 3D vehicle information using key points, since they are rigid objects with well known geometry. Then the vehicle recognition in Deep MANTA can be considered as extra key points detection.
Although these methods propose some effective prior knowledge or reasonable constraints, they fail to get promising performance because of the lack of spatial information. Another recently proposed method  for monocular 3D object detection introduces a multi-level fusion based scheme utilizes a stand-alone module to estimate the disparity information and fuse it with RGB information in the input data encoding, 2D box estimation and 3D box estimation phase, respectively. Although it used depth (or disparity) many times, they only regard it as auxiliary information of RGB features, and do not make full use of its potential value. In comparison, our method takes the generated depth as the core feature and processes it in 3D space which expresses spatial information explicitly.
Lidar-based 3D Object Detection: Although our approach is for monocular image data, we transform the data representation into point cloud which is same to LiDAR-based methods. So, we also introduce some typical approach based on LiDAR. MV3D  encode 3D point clouds with multi-view feature maps, enabling region-based representation for multimodal fusion. With the development of deep learning on raw point clouds [23, 24, 12], several detection approaches only based on raw LiDAR data are also proposed. Qi  extend PointNet to 3D detection task by extracting the frustum point clouds corresponding to their 2D detections. VoxelNet  divides point clouds into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation. Finally, 2D convolution layers are used on these high-level voxel-wise features to get spatial features and give prediction results. Despite these two methods get a promising detection results, they do not make a good use of RGB information. In comparison, we also introduce a RGB features fusion module to enhance the discriminative capability of point clouds.
3 Proposed Method
In this section, we describe the proposed framework for monocular-based 3D object detection. We first present an overview of the proposed method, and then introduce the details of it. Finally, we show the optimization and implementation details for the overall network.
3.1 Approach Overview
As shown in Fig. 2, the proposed 3D detection framework consists of two main stages. In 3D data generation phase, we trained two deep CNNs to do intermediate tasks (2D detection and depth estimation) to get position and depth information. In particular, we transfer the generated depth into point cloud which is a better representation for 3D detection, and then we use 2D bounding box to get the prior information about the location of the RoI (region of interest). Finally, we extract the points in each RoI as our input data for subsequent steps. In 3D box estimation phase, in order to improve the final task, we design two modules for background points segmentation and RGB information aggregation, respectively. After that, we use PointNet as our backbone net to predict the 3D location, dimension and orientation for each RoI. Note that the confidence scores of 2D boxes are assigned to their corresponding 3D boxes.
3.2 3D Data Generation
Intermediate tasks. As we all know that 3D detection using only monocular images is a very challenging task because image appearance can not determine the 3D coordinates of the object. Therefore, we train two deep CNN to generate depth map and 2D bounding box to provide spatial information and position prior. We adopt some existing algorithms to do these intermediate tasks, and give a detailed analysis of the impact of these algorithms on overall performance in experiment part.
Input representation. This work focuses more on how to use depth information than on how to get them. We believe that one of the main reasons why previous images-based 3D detectors fails to get better results is they don’t make good use of depth maps. Simply using depth map as an additional channel of RGB image such as [31, 18]
, and then expecting neural network to extract effective features automatically is not the best solution. In contrast, we transform the estimated depth into point cloud with the help of camera calibration file provided by KITTI (see Fig.1 for different input representations) and then use it as our data input form. Specifically, given a pixel coordinate with depth in the 2D image space, the 3D coordinates in camera coordinate system can be computed as:
where is the focal length of the camera, is the principal point. The input point cloud can be generated using depth map and 2D bounding box B as follow:
where is the pixel in depth map and F is the transforming function introduced by Eq. 1. It should be noted that, like most of monocular-based methods, we use camera calibration file in our approach. Actually, we can also use a point cloud encoder-decoder net to learn a mapping from to , thus we don’t need camera during the testing phase any more. In our measurements, we observe that there is no visible performance difference between these two methods. This is because the error introduced in the point cloud generation phase is much less than the noise contained in the depth map itself.
3.3 3D Box Estimation
Point segmentation. After the 3D data generation phase, the input data is encoded as points cloud. However, there are many background points in these data and these background points should be discarded in order to estimate the position of target accurately. Qi  propose a 3D instance segmentation PointNet to solve this problem in LiDAR data. But that strategy requires additional pre-processing to generate segmentation labels from 3D object ground truth. More importantly, there will be severe noise even if we use the same labelling method because the points we reconstruct are relatively unstable. For these reasons, we propose a simple but effective segmentation method based on depth prior to segment the points. Specifically, we first compute the depth mean in each 2D bounding box in order to get the approximate position of RoI, and use it as the threshold. All points with Z-channel value greater than this threshold are considered as background points. The processed point set can be expressed as:
where denotes the Z-channel value (which is equal to depth) of the point and is a bias used to correct the threshold. Finally, we randomly select a fixed number of points in point set as the output of this module in order to ensuring consistency of number of subsequent network’s input points.
3D box estimation. Before we estimate final 3D results, we follow  to predict the center of RoI using a lightweight network and use it to update the point cloud as follow:
where is the set of points we used to do final task. Then, we choose PointNet  as our 3D detection backbone network to estimate the 3D object which is encoded by its center , size and heading angle . Same as other works, we only consider one orientation because of the assumption that the road surface is flat and the other two angles do not have possible variation. One other thing to note is that the center we estimate here is a ’residual’ center, which means the real center is . Finally, we assign the confidence scores of the 2D bounding boxes to their corresponding 3D detection results.
3.4 RGB Information Aggregation
In order to further improve the performance and robustness of our method, we propose to aggregate complementary RGB information to point cloud. Specifically, we add RGB information to the generated point cloud by replacing Eq. 2 with:
is a function which output the corresponding RGB values of input point.， In this way, the points are encoded as 6D vectors:. However, simply relying on this simple method (we call it ’plain concat’ in experiment part) to add RGB information is not feasible. So, as shown in Fig. 3, we introduce an attention mechanism for the fusion task. The attention mechanism has been successfully applied in various tasks such as image caption generation and machine translation for selecting useful information. Specifically, we utilize the attention mechanism for guiding the message passing between the spatial features and RGB features. Since the passed information flow is not always useful, the attention can act as a gate function to control the flow, in other words to make the network automatically learn to focus or to ignore information from other features. When we pass RGB message to its corresponding point, an attention map is first produced from the feature maps generated from XYZ branch as follow:
where is the nonlinear function learned from a convolution layer and
is a sigmoid function for normalizing the attention map. Then the message is passed with the attention map controlled as follow:
where denotes element-wise multiplication. In addition to point-level features fusion, we also introduce another branch to provide object-level RGB information. In particular, we first crop the RoI from RGB image and resize it to 128128. Then we use a CNN to extract the object-level feature maps and the final feature maps set obtained from the fusion module is: , where denotes the concatenation operation.
3.5 Implementation Details.
The whole training process is performed with two phases. In the first phase, we only optimize the intermediate nets according to the training strategies of original papers. After that, we simultaneously optimize the two networks for 3D detection jointly with a multi-task loss function:
where is the loss function for the lightweight location net (center only) and is for 3D detection net (center, size and heading angle). We also use the corner loss 
where the output targets are first decoded into oriented 3D boxes and then smooth L1 loss is computed on the (x, y, z) coordinates of eight box corners directly with regard to ground truth. We train the nets for 200 epochs using Adam optimizer with batch size of 32. The learning rate is initially set to 0.001 and reduced by half for every 20 epochs. The whole training process can be completed in one day.
The proposed method is implemented base on PyTorch and on Nvidia 1080Ti GPUs. The two intermediate networks of proposed method naturally supports any network structure. We implement some different methods as described in their papers exactly, and the relevant analysis can be found in experimental part. For the 3D detection nets, we use PointNet as our backbone nets and train them from scratch with random initialization. Moreover, the dropout strategy with keep rate 0.7 is applied into every fully connected layers except the last one. For the RGB values, we first normalize the range of them to (0, 1) by dividing 255, and then the data distribution of each color channel is regularized into standard normal distribution. For the region branch in RGB features fusion module, we use ResNet-34 with half channels and global pooling to get the 11256 features.
4 Experimental Results
We evaluate our approach on the challenging KITTI dataset  which provides 7,481 images for training and 7,518 images for testing. Detection and localization tasks are evaluated in three regimes: easy, moderate and hard, according to the occlusion and truncation levels of objects. Since the ground truth for the test set is not available and the access to the test server is limited, we conduct comprehensive evaluation using the protocol described in [3, 4, 5], and subdivide the training data into a training set and a validation set, which results in 3,712 data samples for training and 3,769 data samples for validation. The split avoids samples from the same sequence being included in both training and validation set.
4.1 Comparing with other methods
Baselines. As this work aims at monocular 3D object detection, our approach is mainly compared to other methods with only monocular images as input. Here five methods are chosen for comparisons: Mono3D , Deep3DBox  and Multi-Fusion , ROI-10D  and MonoGRNet .
Car. The evaluation results of 3D localization and detection tasks on KITTI validation set are presented in Table 1 and 2, respectively. The proposed method consistently outperforms all the competing approaches across all three difficulty levels. For localization task, the proposed method outperforms the state-of-the-art Multi-Fusion  by 15% in moderate setting. For 3D detection task, our method achieves 12.2% and 10.9% improvement (moderate) over the recently proposed MonoGRNet  under IoU thresholds of 0.5 and 0.7. In the easy setting, our improvement is more prominent. Specifically, our method achieves 21.7% and 18.4% improvement over previous state-of-the-art on localization and detection tasks (IoU=0.7). Besides, Table 3 shows the results on testing set, and the anonymous submission on KITTI official server can be found here. The testing set results also show the superiority of our method in performance compared with others. Note that there is no complicated prior knowledge or constraints such as [3, 4, 18], which strongly confirms the importance of data representation.
Pedestrian and Cyclist. Most of previous image-based 3D detection methods only focus on Car category as KITTI provides enough instances to train their models. Our model can also get a promising detection performance on Pedestrian and Cyclist category because it is much easier and effective to do data augmentation for point cloud. Table 4 shows their and on KITTI validation set.
4.2 Detailed analysis of proposed method
In this section we provide analysis and ablation experiments to validate our design choices.
RGB information. We further evaluate the effect of the proposed RGB fusion module, and the baslines are the proposed method without RGB values and using them as additional channels of generated points. Table 5 shows the relevant results for Car category on KITTI. It can be seen that the proposed module obtains around 2.1 and 1.6 points mAP improvement (moderate) on localization and detection task, and the qualitative comparisons can be found in Fig 6. Besides, one thing to note is that incorrect use of RGB information such as plain concat will lead to performance degradation.
Points segmentation. We compare the proposed points segmentation method and the 3D segmentation PointNet which is used in 
. The baseline is to estimate 3D boxes directly using point clouds with noise which can be regarded as all points are classified into positive samples. As shown in Table6, our prior-based method outperforms baseline and segmentation PointNet obviously which proves the effectiveness of the proposed method and Table 7 shows that the proposed method is robust for varying thresholds. Meanwhile, the experimental results also show that the learning-based method is not applicable to approximate point clouds segmentation task because it’s difficult to obtain reliable labels. Besides, the proposed method is also much faster than segmentation PointNet (around 5ms on CPU v.s. 20ms on GPU).
|seg-net used in ||0.5||67.01||45.51||40.65|
|seg-net used in ||0.7||29.49||18.70||16.57|
Depth maps. As described in Sec. 3, our approach depends on the point clouds generated from the output of depth generator. In order to study the impact of depth map quality on the overall performance of proposed method, we implemented four different depth generators [9, 14, 20, 2]. From the results shown in Table 8, we find that 3D detection accuracy increases significantly when using more accurate depth. It’s worth noting that even if we use the unsupervised monocular depth generator , the proposed method still outperforms the previous state-of-the-art  by a large margin.
Sampling quantity. Some studies such as [23, 24] observe that classification/segmentation accuracy will decrease dramatically as the number of points decreases, and we will show that our approach is not so sensitive to the number of points. In our approach, we randomly select a fixed number (512 points for default configuration) of point clouds to do 3D detection task. Table. 9 shows the performance of proposed method under different sampling quantity. According to the results, will increase as the number of points increases at the beginning. Then, after reaching a certain level (512 points), the performance tends to be stable. It is worth noting that we still get a relatively good detection performance even if there are few sampling points.
Robustness. We show that the proposed method is robust to various kinds of input corruptions. We first set the sampling quantity to 512 in training phase, but use different values in the testing phase. Fig. 4 shows that the proposed method has more than 70% even when 80% of the points are missed. Then, we test the robustness of model to point perturbations, and the results are shown in Fig 4.
Network architecture. We also investigate the impact of different 3D detection network architectures on overall performance, and the experimental result are shown in Table. 10.
Extensions. To further evaluate the proposed method, we extend it to stereo-based and LiDAR-based versions. We select some representational methods and report the comparative results in Table 11. Note that the proposed method with LiDAR data outperforms F-PointNet  by 1.8% , which proves that our RGB fusion module is equally effective for LiDAR-based methods.
4.3 Qualitative Results and Failure Mode
We show some detection results of our approach in Fig. 5 and a typical localization result in Fig. 7. In general, our algorithm can get a good result. However, because it’s a 2D-driven framework, the proposed method will fail if the 2D box is a false positive sample or missing. Besides, for distant objects, our algorithm is difficult to give accurate results because the depth is not reliable (the leftmost car in Fig. 7 is 70.35 meters away from the camera).
We proposed a framework for accurate 3D object detection with monocular images in this paper. Unlike other image-based methods, our method solves this problem in the reconstructed 3D space in order to exploit 3D contexts explicitly. We argue that the point cloud representation is more suitable for 3D related tasks than images. Besides, we propose a multi-modal feature fusion module to embed the complementary RGB cue into the generated point clouds representation To enhance the discriminative capability of generated point clouds. Our approach significantly outperforms existing monocular-based method for 3D localization and detection tasks on KITTI benchmark. In addition, the extended versions verifies the design strategy can also be applied to stereo-based and LiDAR-based methods.
F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulière, and T. Chateau.
Deep manta: A coarse-to-fine many-task network for joint 2d and 3d
vehicle analysis from monocular image.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), pages 2040–2049, 2017.
-  J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018.
-  X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2156, 2016.
-  X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, pages 424–432, 2015.
-  X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In IEEE CVPR, volume 1, page 3, 2017.
-  A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2626–2635, 2018.
-  B. Li. 3d fully convolutional network for vehicle detection in point cloud. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 1513–1518. IEEE, 2017.
-  M. Liang, B. Yang, S. Wang, and R. Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 641–656, 2018.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 3, 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
-  W. Luo, B. Yang, and R. Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
-  F. Manhardt, W. Kehl, and A. Gaidon. Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. arXiv preprint arXiv:1812.02781, 2018.
D. Maturana and S. Scherer.
Voxnet: A 3d convolutional neural network for real-time object recognition.In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
-  N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
-  A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká. 3d bounding box estimation using deep learning and geometry. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5632–5640. IEEE, 2017.
-  C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgb-d data. arXiv preprint arXiv:1711.08488, 2017.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
-  Z. Qin, J. Wang, and Y. Lu. Monogrnet: A geometric reasoning network for monocular 3d object localization. arXiv preprint arXiv:1811.10247, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Z. Ren and E. B. Sudderth. Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1525–1533, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In European conference on computer vision, pages 634–651. Springer, 2014.
-  B. Xu and Z. Chen. Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2345–2353, 2018.
-  D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  B. Yang, W. Luo, and R. Urtasun. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018.
-  Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. arXiv preprint arXiv:1711.06396, 2017.