Multi-View 3D Object Detection Network for Autonomous Driving
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird's eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmark show that our approach outperforms the state-of-the-art by around 25 addition, for 2D detection, our approach obtains 10.3 state-of-the-art on the hard data among the LIDAR-based methods.READ FULL TEXT VIEW PDF
This paper presents Multi-view Labelling Object Detector (MLOD). The det...
Autonomous driving requires the inference of actionable information such...
This paper focuses on the construction of stronger local features and th...
3D object detection algorithms for autonomous driving reason about 3D
Accurately estimating the orientation of pedestrians is an important and...
Detecting objects in a two-dimensional setting is often insufficient in ...
3D object detection based on LiDAR-camera fusion is becoming an emerging...
Multi-View 3D Object Detection Network for Autonomous Driving
Multiview 3D Vehicle Detection using only LiDAR
3D object detection plays an important role in the visual perception system of Autonomous driving cars. Modern self-driving cars are commonly equipped with multiple sensors, such as LIDAR and cameras. Laser scanners have the advantage of accurate depth information while cameras preserve much more detailed semantic information. The fusion of LIDAR point cloud and RGB images should be able to achieve higher performance and safty to self-driving cars.
The focus of this paper is on 3D object detection utilizing both LIDAR and image data. We aim at highly accurate 3D localization and recognition of objects in the road scene. Recent LIDAR-based methods place 3D windows in 3D voxel grids to score the point cloud [26, 7] or apply convolutional networks to the front view point map in a dense box prediction scheme . Image-based methods [4, 3] typically first generate 3D box proposals and then perform region-based recognition using the Fast R-CNN  pipeline. Methods based on LIDAR point cloud usually achieve more accurate 3D locations while image-based methods have higher accuracy in terms of 2D box evaluation. [11, 8] combine LIDAR and images for 2D detection by employing early or late fusion schemes. However, for the task of 3D object detection, which is more challenging, a well-designed model is required to make use of the strength of multiple modalities.
In this paper, we propose a Multi-View 3D object detection network (MV3D) which takes multimodal data as input and predicts the full 3D extent of objects in 3D space. The main idea for utilizing multimodal information is to perform region-based feature fusion. We first propose a multi-view encoding scheme to obtain a compact and effective representation for sparse 3D point cloud. As illustrated in Fig. 1, the multi-view 3D detection network consists of two parts: a 3D Proposal Network and a Region-based Fusion Network. The 3D proposal network utilizes a bird’s eye view representation of point cloud to generate highly accurate 3D candidate boxes. The benefit of 3D object proposals is that it can be projected to any views in 3D space. The multi-view fusion network extracts region-wise features by projecting 3D proposals to the feature maps from mulitple views. We design a deep fusion approach to enable interactions of intermediate layers from different views. Combined with drop-path training  and auxiliary loss, our approach shows superior performance over the early/late fusion scheme. Given the multi-view feature representation, the network performs oriented 3D box regression which predict accurate 3D location, size and orientation of objects in 3D space.
We evaluate our approach for the tasks of 3D proposal generation, 3D localization, 3D detection and 2D detection on the challenging KITTI  object detection benchmark. Experiments show that our 3D proposals significantly outperforms recent 3D proposal methods 3DOP  and Mono3D . In particular, with only 300 proposals, we obtain 99.1% and 91% 3D recall at Intersection-over-Union (IoU) threshold of 0.25 and 0.5, respectively. The LIDAR-based variant of our approach achieves around 25% higher accuracy in 3D localization task and 30% higher 3D Average Precision (AP) in the task of 3D object detection. It also outperforms all other LIDAR-based methods by 10.3% AP for 2D detection on KITTI’s hard test set. When combined with images, further improvements are achieved over the LIDAR-based results.
We briefly review existing work on 3D object detection from point cloud and images, multimodal fusion methods and 3D object proposals.
Most existing methods encode 3D point cloud with voxel grid representation. Sliding Shapes  and Vote3D  apply SVM classifers on 3D grids encoded with geometry features. Some recently proposed methods [23, 7, 16] improve feature representation with 3D convolutions.networks, which, however require expensive computations. In addition to the 3D voxel representation, VeloFCN  projects point cloud to the front view, obtaining a 2D point map. They apply a fully convolutional network on the 2D point map and predict 3D boxes densely from the convolutional feature maps. [24, 18, 12] investigate volumetric and multi-view representation of point cloud for 3D object classification. In this work, we encode 3D point cloud with multi-view feature maps, enabling region-based representation for multimodal fusion.
introduces 3D voxel patterns and employ a set of ACF detectors to do 2D detection and 3D pose estimation. 3DOP reconstructs depth from stereo images and uses an energy minimization approach to generate 3D box proposals, which are fed to an R-CNN  pipeline for object recognition. While Mono3D  shares the same pipeline with 3DOP, it generates 3D proposals from monocular images. [31, 32] introduces a detailed geometry representation of objects using 3D wireframe models. To incorporate temporal information, some work[6, 21] combine structure from motion and ground estimation to lift 2D detection boxes to 3D bounding boxes. Image-based methods usually rely on accurate depth estimation or landmark detection. Our work shows how to incorporate LIDAR point cloud to improve 3D localization.
Only a few work exist that exploit multiple modalities of data in the context of autonomous driving.  combines images, depth and optical flow using a mixture-of-experts framework for 2D pedestrian detection. 
fuses RGB and depth images in the early stage and trains pose-based classifiers for 2D detection. In this paper, we design a deep fusion approach inspired by FractalNet and Deeply-Fused Net . In FractalNet, a base module is iteratively repeated to construct a network with exponentially increasing paths. Similarly,  constructs deeply-fused networks by combining shallow and deep subnetworks. Our network differs from them by using the same base network for each column and adding auxiliary paths and losses for regularization.
Similarly to 2D object proposals [25, 33, 2], 3D object proposal methods generate a small set of 3D candidate boxes in order to cover most of the objects in 3D space. To this end, 3DOP  designs some depth features in stereo point cloud to score a large set of 3D candidate boxes. Mono3D  exploits the ground plane prior and utilizes some segmentation features to generate 3D proposals from a single image. Both 3DOP and Mono3D use hand-crated features. Deep Sliding Shapes 
exploits more powerful deep learning features. However, it operates on 3D voxel grids and uses computationally expensive 3D convolutions. We propose a more efficient approach by introducing a bird’s eye view representation of point cloud and employing 2D convolutions to generate accurate 3D proposals.
The MV3D network takes a multi-view representation of 3D point cloud and an image as input. It first generates 3D object proposals from the bird’s eye view map and deeply fuses multi-view features via region-based representation. The fused features are used for category classification and oriented 3D box regression.
. While the 3D grid representation preserves most of the raw information of the point cloud, it usually requires much more complex computation for subsequent feature extraction. We propose a more compact representation by projecting 3D point cloud to the bird’s eye view and the front view. Fig.2 visualizes the point cloud representation.
The bird’s eye view representation is encoded by height, intensity and density. We discretize the projected point cloud into a 2D grid with resolution of 0.1m. For each cell, the height feature is computed as the maximum height of the points in the cell. To encode more detailed height information, the point cloud is devided equally into slices. A height map is computed for each slice, thus we obtain height maps. The intensity feature is the reflectance value of the point which has the maximum height in each cell. The point cloud density indicates the number of points in each cell. To normalize the feature, it is computed as , where is the number of points in the cell. Note that the intensity and density features are computed for the whole point cloud while the height feature is computed for slices, thus in total the bird’s eye view map is encoded as -channel features.
Front view representation provides complementary information to the bird’s eye view representation. As LIDAR point cloud is very sparse, projecting it into the image plane results in a sparse 2D point map. Instead, we project it to a cylinder plane to generate a dense front view map as in . Given a 3D point , its coordinates in the front view map can be computed using
where and are the horizontal and vertical resolution of laser beams, respectively. We encode the front view map with three-channel features, which are height, distance and intensity, as visualized in Fig. 2.
Inspired by Region Proposal Network (RPN) which has become the key component of the state-of-the-art 2D object detectors 
, we first design a network to generate 3D object proposals. We use the bird’s eye view map as input. In 3D object detection, The bird’s eye view map has several advantages over the front view/image plane. First, objects preserve physical sizes when projected to the bird’s eye view, thus having small size variance, which is not the case in the front view/image plane. Second, objects in the bird’s eye view occupy different space, thus avoiding the occlusion problem. Third, in the road scene, since objects typically lie on the ground plane and have small variance in vertical location, the bird’s eye view location is more crucial to obtaining accurate 3D bounding boxes. Therefore, using explicit bird’s eye view map as input makes the 3D location prediction more feasible.
Given a bird’s eye view map. the network generates 3D box proposals from a set of 3D prior boxes. Each 3D box is parameterized by , which are the center and size (in meters) of the 3D box in LIDAR coordinate system. For each 3D prior box, the corresponding bird’s eye view anchor can be obtained by discretizing . We design 3D prior boxes by clustering ground truth object sizes in the training set. In the case of car detection, of prior boxes takes values in , and the height is set to 1.56m. By rotating the bird’s eye view anchors 90 degrees, we obtain prior boxes. is the varying positions in the bird’s eye view feature map, and can be computed based on the camera height and object height. We do not do orientation regression in proposal generation, whereas we left it to the next prediction stage. The orientations of 3D boxes are restricted to , which are close to the actual orientations of most road scene objects. This simplification makes training of proposal regression easier.
With a disretization resolution of 0.1m, object boxes in the bird’s eye view only occupy 540 pixels. Detecting such extra-small objects is still a difficult problem for deep networks. One possible solution is to use higher resolution of the input, which, however, will require much more computation. We opt for feature map upsampling as in . We use 2x bilinear upsampling after the last convolution layer in the proposal network. In our implementation, the front-end convolutions only proceed three pooling operations, i.e., 8x downsampling. Therefore, combined with the 2x deconvolution, the feature map fed to the proposal network is 4x downsampled with respect to the bird’s eye view input.
We do 3D box regression by regressing to , similarly to RPN . are the center offsets normalized by anchor sizes, and are computed as . we use a multi-task loss to simultaneously classify object/background and do 3D box regression. In particular, we use class-entropy for the “objectness” loss and Smooth  for the 3D box regression loss. Background anchors are ignored when computing the box regression loss. During training, we compute the IoU overlap between anchors and ground truth bird’s eye view boxes. An anchor is considered to be positive if its overlap is above 0.7, and negative if the overlap is below 0.5. Anchors with overlap in between are ignored.
Since LIDAR point cloud is sparse, which results in many empty anchors, we remove all the empty anchors during both training and testing to reduce computation. This can be achieved by computing an integral image over the point occupancy map.
For each non-empty anchor at each position of the last convolution feature map, the network generates a 3D box. To reduce redundancy, we apply Non-Maximum Suppression (NMS) on the bird’s eye view boxes. Different from , we did not use 3D NMS because objects should occupy different space on the ground plane. We use IoU threshold of 0.7 for NMS. The top 2000 boxes are kept during training, while in testing, we only use 300 boxes.
We design a region-based fusion network to effectively combine features from multiple views and jointly classify object proposals and do oriented 3D box regression.
Since features from different views/modalities usually have different resolutions, we employ ROI pooling 
for each view to obtain feature vectors of the same length. Given the generated 3D proposals, we can project them to any views in the 3D space. In our case, we project them to three views, i.e., bird’s eye view (BV), front view (FV), and the image plane (RGB). Given a 3D proposal, we obtain ROIs on each view via:
where denotes the tranformation functions from the LIDAR coordinate system to the bird’s eye view, front view, and the image plane, respectively. Given an input feature map from the front-end network of each view, we obtain fixed-length features via ROI pooling:
To combine information from different features, prior work usually use early fusion  or late fusion [23, 13]. Inspired by [15, 27], we employ a deep fusion approach, which fuses multi-view features hierarchically. A comparison of the architectures of our deep fusion network and early/late fusion networks are shown in Fig. 3. For a network that has layers, early fusion combines features from multiple views in the input stage:
are feature transformation functions and is a join operation (e.g., concatenation, summation). In contrast, late fusion uses seperate subnetworks to learn feature transformation independently and combines their outputs in the prediction stage:
To enable more interactions among features of the intermediate layers from different views, we design the following deep fusion process:
We use element-wise mean for the join operation for deep fusion since it is more flexible when combined with drop-path training .
Given the fusion features of the multi-view network, we regress to oriented 3D boxes from 3D proposals. In particular, the regression targets are the 8 corners of 3D boxes: . They are encoded as the corner offsets normalized by the diagonal length of the proposal box. Despite such a 24-D vector representation is redundant in representing an oriented 3D box, we found that this encoding approach works better than the centers and sizes encoding approach. Note that our 3D box regression differs from  which regresses to axis-aligned 3D boxes. In our model, the object orientations can be computed from the predicted 3D box corners. We use a multi-task loss to jointly predict object categories and oriented 3D boxes. As in the proposal network, the category loss uses cross-entropy and the 3D box loss uses smooth . During training, the positive/negative ROIs are determined based on the IoU overlap of brid’s eye view boxes. A 3D proposal is considered to be positive if the bird’s eye view IoU overlap is above 0.5, and negative otherwise. During inference, we apply NMS on the 3D boxes after 3D bounding box regression. We project the 3D boxes to the bird’s eye view to compute their IoU overlap. We use IoU threshold of 0.05 to remove redundant boxes, which ensures objects can not occupy the same space in bird’s eye view.
We employ two approaches to regularize the region-based fusion network: drop-path training  and auxiliary losses
. For each iteration, we randomly choose to do global drop-path or local drop-path with a probability of 50%. If global drop-path is chosen, we select a single view from the three views with equal probability. If local drop-path is chosen, paths input to each join node are randomly dropped with 50% probability. We ensure that for each join node at least one input path is kept. To further strengthen the representation capability of each view, we add auxiliary paths and losses to the network. As shown in Fig.4, the auxiliary paths have the same number of layers with the main network. Each layer in the auxiliary paths shares weights with the corresponding layer in the main network. We use the same multi-task loss, i.e. classification loss plus 3D box regression loss, to back-propagate each auxiliary path. We weight all the losses including auxiliary losses equally. The auxiliary paths are removed during inference.
In our multi-view network, each view has the same architecture. The base network is built on the 16-layer VGG net  with the following modifications:
[topsep=0pt, itemsep=0pt, parsep=0pt]
Channels are reduced to half of the original network.
To handle extra-small objects, we use feature approximation to obtain high-resolution feature map. In particular, we insert a 2x bilinear upsampling layer before feeding the last convolution feature map to the 3D Proposal Network. Similarly, we insert a 4x/4x/2x upsampling layer before the ROI pooling layer for the BV/FV/RGB branch.
We remove the 4th pooling operation in the original VGG network, thus the convolution parts of our network proceed 8x downsampling.
In the muti-view fusion network, we add an extra fully connected layer in addition to the original and layer.
We initialize the parameters by sampling weights from the VGG-16 network pretrained on ImageNet. Despite our network has three branches, the number of parameters is about 75% of the VGG-16 network. The inference time of the network for one image is around 0.36s on a GeForce Titan X GPU.
In the case of KITTI, which provides only annotations for objects in the front view (around field of view), we use point cloud in the range of [0, 70.4] [-40, 40] meters. We also remove points that are out of the image boundaries when projected to the image plane. For bird’s eye view, the discretization resolution is set to 0.1m, therefore the bird’s eye view input has size of 704800. Since KITTI uses a 64-beam Velodyne laser scanner, we can obtain a 64512 map for the front view points. The RGB image is up-scaled so that the shortest size is 500.
The network is trained in an end-to-end fashion. For each mini-batch we use 1 image and sample 128 ROIs, roughly keeping 25% of the ROIs as positive. We train the network using SGD with a learning rate of 0.001 for 100K iterations. Then we reduce the learning rate to 0.0001 and train another 20K iterations.
|Data||AP (IoU=0.5)||AP (IoU=0.5)||AP (IoU=0.7)|
|Deep Fusion w/o aux. loss||94.21||88.29||87.21||94.57||88.75||88.02||88.64||85.74||79.06|
|Deep Fusion w/ aux. loss||96.02||89.05||88.38||96.34||89.39||88.67||95.01||87.59||79.90|
|Data||AP (IoU=0.5)||AP (IoU=0.5)||AP (IoU=0.7)|
We evaluate our MV3D network on the challenging KITTI object detection benchmark . The dataset provides 7,481 images for training and 7,518 images for testing. As the test server only evaluates 2D detection, we follow  to split the training data into training set and validation set, each containing roughly half of the whole training data. We conduct 3D box evaluation on the validation set. We focus our experiments on the car category as KITTI provides enough car instances for our deep network based approach. Following the KITTI setting, we do evaluation on three difficulty regimes: easy, moderate and hard.
We evaluate 3D object proposals using 3D box recall as the metric. Different from 2D box recall , we compute the IoU overlap of two cuboids. Note that the cuboids are not necessary to align with the axes, i.e., they could be oriented 3D boxes. In our evaluation, we set the 3D IoU threshold to 0.25 and 0.5, respectively. For the final 3D detection results, we use two metrics to measure the accuracy of 3D localization and 3D bounding box detection. For 3D localization, we project the 3D boxes to the ground plane (i.e., bird’s eye view) to obtain oriented bird’s eye view boxes. We compute Average Precision (AP) for the bird’s eye view boxes. For 3D bounding box detection, we also use the Average Precision (AP) metric to evaluate the full 3D bounding boxes. Note that both the bird’s eye view boxes and the 3D boxes are oriented, thus object orientations are implicitly considered in these two metrics. We also evaluate the performance of 2D detection by projecting the 3D boxes to the image plane. Average Preicision (AP) is also used as the metric. Following the KITTI convention, IoU threshold is set to 0.7 for 2D boxes.
As this work aims at 3D object detection, we mainly compare our approach to LIDAR-based methods VeloFCN , 3D FCN , Vote3Deep  and Vote3D , as well as image-based methods 3DOP  and Mono3D . For fair comparison, we focus on two variants of our approach, i.e., the purely LIDAR-based variant which uses bird’s eye view and front view as input (BV+FV), and the multimodal variant which combines LIDAR and RGB data (BV+FV+RGB). For 3D box evaluation, we compare with VeloFCN, 3DOP and Mono3D since they provide results on the validation set. For 3D FCN, Vote3Deep and Vote3D, which have no results publicly available, we only do comparison on 2D detection on the test set.
|Faster R-CNN ||Mono||87.90||79.11||70.19|
|SDP+RPN [30, 19]||Mono||89.90||89.42||78.54|
|3D FCN ||LIDAR||85.54||75.83||68.30|
3D box recall are shown in Fig. 5. We plot recall as a function of IoU threshold using 300 proposals. Our approach significantly outperforms 3DOP  and Mono3D  across all the IoU thresholds. Fig. 5 also shows 3D recall as a function of the proposal numbers under IoU threshold of 0.25 and 0.5, respectively. Using only 300 proposals, our approach obtains 99.1% recall at IoU threshold of 0.25 and 91% recall at IoU of 0.5. In contrast, when using IoU of 0.5, the maximum recall that 3DOP can achieve is only 73.9%. The large margin suggests the advantage of our LIDAR-based approach over image-based methods.
We use IoU threshold of 0.5 and 0.7 for 3D localization evaluation. Table 1 shows AP on KITTI validation set. As expected, all LIDAR-based approaches performs better than stereo-based method 3DOP  and monocular method Mono3D . Among LIDAR-based approaches, our method (BV+FV) outperforms VeloFCN  by 25% AP under IoU threshold of 0.5. When using IoU=0.7 as the criteria, our improvement is even larger, achieving 45% higher AP across easy, moderate and hard regimes. By combining with RGB images, our approach is further improved. We visualize the localization results of some examples in Fig. 6.
For the 3D overlap criteria, we focus on 3D IoU of 0.5 and 0.7 for LIDAR-based methods. As these IoU thresholds are rather strict for image-based methods, we also use IoU of 0.25 for evaluation. As shown in Table 2, our “BV+FV” method obtains 30% higher AP over VeloFCN when using IoU of 0.5, achieving 87.65% AP in the moderate setting. With criteria of IoU=0.7, our multimodal approach still achieves 71.29% AP on easy data. In the moderate setting, the best AP that can be achieved by 3DOP using IoU=0.25 is 68.82%, while our approach achieves 89.05% AP using IoU=0.5. Some 3D detectioin results are visualized in Fig. 6.
We first compare our deep fusion network with early/late fusion approaches. As commonly used in literature, the join operation is instantiated with concatenation in the early/late fusion schemes. As shown in Table 3, early and late fusion approaches have very similar performance. Without using auxiliary loss, the deep fusion method achieves 0.5% improvement over early and late fusion approaches. Adding auxiliary loss further improves deep fusion network by around 1%.
To study the contributions of the features from different views, we experiment with different combination of the bird’s eye view (BV), the front view (FV), and the RGB image (RGB). The 3D proposal network is the same for all the variants. Detailed comparisons are shown in Table 4. If using only a single view as input, the bird’s eye view feature performs the best while the front view feature the worst. Combining any of the two views can always improve over individual views. This justifies our assumption that features from different views are complementary. The best overal performance can be achieved when fusing features of all three views.
We finally evaluate 2D detection performance on KITTI test set. Results are shown in Table 5. Among the LIDAR-based methods, our “BV+FV” approach outperforms the recently proposed 3D FCN  method by 10.31% AP in the hard setting. In overall, image-based methods usually perform better than LIDAR-based methods in terms of 2D detection. This is due to the fact that image-based methods directly optimize 2D boxes while LIDAR-based methods optimize 3D boxes. Note that despite our method optimizes 3D boxes, it also obtains competitive results compared with the state-of-the-art 2D detection methods.
We have proposed a multi-view sensory-fusion model for 3D object detection in the road scene. Our model takes advantage of both LIDAR point cloud and images. We align different modalities by generating 3D proposals and projecting them to multiple views for feature extraction. A region-based fusion network is presented to deeply fuse multi-view information and do oriented 3D box regression. Our approach significantly outperforms existing LIDAR-based and image-based methods on tasks of 3D localization and 3D detection on KITTI benchmark . Our 2D box results obtained from 3D detections also show competitive performance compared with the state-of-the-art 2D detection methods.
The work was supported by National Key Basic Research Program of China (No. 2016YFB0100900) and NSFC 61171113.
A unified multi-scale deep convolutional neural network for fast object detection.In ECCV, 2016.
A continuous occlusion model for road scene understanding.In CVPR, pages 4331–4339, 2016.
On-board object detection: Multicue, multimodal, and multiview random forest of local experts.In IEEE Transactions on Cybernetics, 2016.