1 Introduction
3D object detection is a fundamental problem and has many applications such as autonomous driving and robotics. Previous methods show promising results by utilizing LiDAR device, which produces precise depth information in terms of 3D point clouds. However, due to the highcost and sparse output of LiDAR, it is desirable to seek cheaper alternatives like monocular cameras. This problem remains largely unsolved, though it has drawn much attention.
Recent methods towards the above goal can be generally categorized into two streams as imagebased approaches [36, 26, 41, 19, 17, 4] and pseudoLiDAR pointbased approaches [48, 33, 50]. The imagebased approaches [5, 17]
typically leverage geometry constraints including object shape, ground plane, and key points. These constraints are formulated as different terms in loss function to improve detection results. The pseudoLiDAR pointbased approaches transform depth maps estimated from 2D images to point cloud representations to mimic the LiDAR signal. As shown in Figure
1, both of these methods have drawbacks, resulting in suboptimal performance.Specifically, the imagebased methods typically fail to capture meaningful local object scale and structure information, because of the following two factors. (1) Due to perspective projection, the monocular view at far and near distance would cause significant changes in object scale. It is difficult for traditional 2D convolutional kernels to process objects of different scales simultaneously (see Figure 2). (2) The local neighborhood of 2D convolution is defined in the camera plane where the depth dimension is lost. In this nonmetric space (i.e. the distance between pixels does not have a clear physical meaning like depth), a filter cannot distinguish objects from the background. In that case, a car area and the background area would be treated equally.
Although pseudoLiDAR pointbased approaches have achieved progressive results, they still possess two key issues. (1) The performance of these approaches heavily relies on the precision of estimated depth maps (see Figure 1). The depth maps extracted from monocular images are often coarse (point clouds estimated using them have wrong coordinates), leading to inaccurate 3D predictions. In other words, the accuracy of the depth map limits the performance of 3D object detection. (2) PseudoLiDAR methods cannot effectively employ highlevel semantic information extracted from RGB images, leading to many false alarms. This is because point clouds provide spatial information but lose semantic information. As a result, regions like roadblocks, electrical boxes and even dust on the road may cause false detection, but they can be easily discriminated by using RGB images.
To address the above problems, we propose a novel convolutional network, termed Depthguided DynamicDepthwiseDilated local convolutional network (DLCN), where the convolutional kernels are generated from the depth map and locally applied to each pixel and channel of individual image sample, rather than learning global kernels to apply to all images. As shown in Figure 2, DLCN treats the depth map as guidance to learn local dynamicdepthwisedilated kernels from RGB images, so as to fill the gap between 2D and 3D representation. More specifically, the learned kernel in DLCN is samplewise (i.e. exemplar kernel [15]), positionwise (i.e. local convolution [20]), and depthwise (i.e. depthwise convolution [18]), where each kernel has its own dilation rate (i.e. different exemplar kernels have different receptive fields).
DLCN is carefully designed with four considerations. (1) The exemplar kernel is to learn specific scene geometry for each image. (2) The local convolution is to distinguish object and background regions for each pixel. (3) The depthwise convolution is to learn different channel filters in a convolutional layer with different purposes and to reduce computational complexity. (4) The exemplar dilation rate is to learn different receptive fields for different filters to account for objects with diverse scales. The above delicate designs can be easily and efficiently implemented by combing linear operators of shift and elementwise product. As a result, the efficient DLCN can not only address the problem of the scalesensitive and meaningless local structure of 2D convolutions, but also benefit from the highlevel semantic information from RGB images compared with the pseudoLiDAR representation.
Our main contributions are threefold. (1) A novel component for 3D object detection, DLCN, is proposed, where the depth map guides the learning of dynamicdepthwisedilated local convolutions from a single monocular image. (2) We carefully design a singlestage 3D object detection framework based on DLCN to learn better 3D representation for reducing the gap between 2D convolutions and 3D point cloudbased operations. (3) Extensive experiments show that DLCN outperforms stateoftheart monocular 3D detection methods and takes the first place on the KITTI benchmark [12]. The code will be released.
2 Related Work
Imagebased Monocular 3D Detection. Previous monocular 3D detection methods [36, 26, 41, 1, 19, 17, 4, 54] usually make assumptions about the scene geometry and use this as a constraint to train the 2Dto3D mapping. Deep3DBox [36] uses the camera matrix to project a predicted 3D box onto the 2D image plane, constraining each side of the 2d detection box, such that it corresponds to any of the eight corners of the 3D box. OFTNet [43] introduces the orthographic feature transform, which maps imagebased features into an orthographic 3D space. It is helpful when scale of objects varies drastically. [21, 31]
investigated different ways of learning the confidence to model heteroscedastic uncertainty by using a 3D intersectionoverunion (IoU) loss. To introduce more prior information,
[2, 24, 57, 53] used 3D shapes as templates to get better object geometry. [23] predicts a point cloud in an objectcentered coordinate system and devises a projection alignment loss to learn local scale and shape information. [34] proposes a 3D synthetic data augmentation algorithm via inpainting recovered meshes directly onto the 2D scenes.However, as it is not easy for 2D image features to represent 3D structures, the above geometric constraints fail to restore accurate 3D information of objects from just a single monocular image. Therefore, our motivation is to utilize depth information, which essentially bridges gap between 2D and 3D representation, to guide learning the 2Dto3D feature representation.
Point Cloudbased Monocular 3D Detection. Previous monocular methods [48, 33, 50] convert imagebased depth maps to pseudoLiDAR representations for mimicking the LiDAR signal. With this representation, existing LiDARbased detection algorithms can be directly applied to monocular 3D object detection. For example, [50] detects 2D object proposals in the input image and extracts a point cloud frustum from the pseudoLiDAR for each proposal. [33] proposes a multimodal features fusion module to embed the complementary RGB cue into the generated point clouds representation. However, this depthtoLiDAR transformation relies heavily on the accuracy of depth map and cannot make use of RGB information. In contrast, our method treats depth map as guidance to learn better 3D representation from RGB images.
LiDARbased 3D Detection.
With the development of deep learning on point sets, 3D feature learning
[39, 40, 59] is able to learn deep pointbased and voxelbased features. Benefit from this, LiDARbased methods have achieved promising results in 3D detection. For example, [59] divides point clouds into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation. [47] applies the FPN technique to voxelbased detectors. [55] investigates a sparse convolution for voxelbased networks. [25] utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). [38] leverages mature 2D object detectors to learn directly from 3D point clouds. [49]aggregates pointwise features as frustumlevel feature vectors.
[44, 8] directly generated a small number of highquality 3D proposals from point clouds via segmenting the point clouds of the whole scene into foreground and background. There are also some works focus on multisensor fusion (LIDAR as well as cameras) for 3D object detection. [29, 28] proposed a continuous fusion layer that encodes both discretestate image features as well as continuous geometric information. [7, 22] used LIDAR point clouds and RGB images to generate features and encoded the sparse 3D point cloud with a compact multiview representation.Dynamic Networks. A number of existing techniques can be deployed to exploit the depth information for monocular 3D detection. M3DRPN [1] proposes depthaware convolution which uses nonshared kernels in the rowspace to learn spatiallyaware features. However, this rough and fixed spatial division has bias and fail to capture object scale and local structure. Dynamic filtering network [20] uses the samplespecific and positionspecific filters but has heavy computational cost, and it also fails to solve the scalesensitive problem of 2D convolutions. Trident network [27] utilizes manually defined multihead detectors for 2D detection. However, it needs to manually group data for different heads. Other techniques like deformable convolution [9] and variants of [20] such as [14, 46, 52], fail to capture object scale and local structure as well. In this work, our depthguided dynamic dilated local convolutional network is proposed to solve the two problems associated with 2D convolutions and narrow the gap between 2D convolution and point cloudbased 3D processing.
3 Methodology
As a singlestage 3D detector, our framework consists of three key components: a network backbone, a depthguided filtering module, and a 2D3D detection head (see Figure 3). Details of each component are given below. First, we give an overview of our architecture as well as backbone networks. We then detail our depthguided filtering module which is the key component for bridging 2D convolutions and the point cloudbased 3D processing. Finally, we outline the details of our 2D3D detection head.
3.1 Backbone
To utilize depth maps as guidance of 2D convolutions, we formulate our backbone as a twobranch network: the first branch is the feature extraction network using RGB images, and the other is the filter generation network to generate convolutional kernels for feature extraction network using the estimated depth as input. These two networks process the two inputs separately and their outputs of each block are merged by the depthguided filtering module.
The backbone of the feature extraction network is ResNet50 [16]
without its final FC and pooling layers, and is pretrained on the ImageNet classification dataset
[10]. To obtain a larger fieldofview and keep the network stride at 16, we find the last convolutional layer (conv5_1, block4) that decreases resolution and set its stride to 1 to avoid signal decimation, and replace all subsequent convolutional layers with dilated convolutional layers (the dilation rate is 2). For the filter generation network, we only use the first three blocks of ResNet50 to reduce computational costs. Note the two branches have the same number of channels of each block for the depth guided filtering module.
3.2 DepthGuided Filtering Module
Traditional 2D convolution kernels fail to efficiently model the depthdependent scale variance of the objects and effectively reason about the spatial relationship between foreground and background pixels. On the other hand, pseudolidar representations rely too much on the accuracy of depth and lose the RGB information. To address these problems simultaneously, we propose our depthguided filtering module. Notably, by using our module, the convolutional kernels and their receptive fields (dilation) are different for different pixels and channels of different images.
Since the kernel of our feature extraction network is trained and generated by the depth map, it is samplespecific and positionspecific, as in [20, 14], and thus can capture meaningful local structures as the pointbased operator in point clouds. We first introduce the idea of depthwise convolution [18] to the network, termed depthwise local convolution (DLCN). Generally, depthwise convolution (DCN) involves a set of global filters, where each filter only operates at its corresponding channel, while DLCN requires a feature volume of local filters the same size as the input feature maps. As the generated filters are actually a feature volume, a naive way to perform DLCN requires to convert the feature volume into locationspecific filters and then apply depthwise and local convolutions to the feature maps, where and are the height and width of the feature maps at layer . This implementation would be timeconsuming as it ignores the redundant computations in neighboring pixels. To reduce the time cost, we employ the shift and elementwise product operators, in which shift [51] is a zeroflop zeroparameter operation, and elementwise product requires little calculation. Concretely, let and be the output of the feature extraction network and filter generation network, respectively, where is the index of the block (note that block corresponds to the layer in ResNet). Let denote the kernel size of the feature extraction network. By defining a shifting grid that contains elements, for every vector , we shift the whole feature map towards the direction and step size indicated by and get the result . For example, when , and the feature map is moved towards nine directions with a horizontal or vertical step size of 0 or 1. We then use the sum and elementwise product operations to compute our filtering result:
(1) 
To encourage information flow between channels of the depthwise convolution, we further introduce a novel shiftpooling operator in the module. Considering as the number of channels with information flow, we shift the feature maps along the channel axis for times by to obtain new shifted feature maps . Then we perform elementwise mean to the shifted feature maps and the original to obtain the new feature map as the input of the module. The process of this shiftpooling operation is shown in Figure 4 ().
Compared to the idea ‘group’ of depthwise convolution in [18, 58] which aims to group many channels into a group to perform information fusion between them, the proposed shiftpooling operator is more efficient and adds no additional parameters to the convolution. The size of our convolutional weights of each local kernel is always when applying shiftpooling, while it changes significantly in [18] for different number of groups from to in group convolution (assume that the convolution keeps the number of channels unchanged). Note that it is difficult for the filter generation network to generate so many kernels for the traditional convolutions between all channels, and the characteristic of being positionspecific dramatically increases their computational cost.
With our depthwise formulation, different kernels can have different functions. This enables us to assign different dilation rates [56] for each filter to address the scalesensitive problem. Since there are huge intraclass and interclass scale differences in an RGB image, we use to learn an adaptive dilation rate for each filter to obtain different sizes of receptive fields by an adaptive function . Specifically, let denote our maximum dilation rate, the adaptive function consists of three layers: (1) an AdaptiveMaxPool2d layer with the output size of and channel number ; (2) a convolutional layer with a kernel size of and channel number
; (3) a reshape and softmax layer to generate
weights with a sum of 1 for each filter. Formally, our guided filtering with adaptive dilated function (DLCN) is formulated as follows:(2) 
For different images, our depthguided filtering module assigns different kernels on different pixels and adaptive receptive fields (dilation) on different channels. This solves the problem of scalesensitive and meaningless local structure of 2D convolutions, and also makes full use of RGB information compared to pseudoLiDAR representations.
3.3 2D3D Detection Head
In this work, we adopt a singlestage detector with priorbased 2D3D anchor boxes [42, 32] as our base detector.
3.3.1 Formulation
Inputs: The output feature map of our backbone network with a network stride factor of 16. Following common practice, we use a calibrated setting which assumes that perimage camera intrinsics are available both at the training and test time. The 3Dto2D projection can be written as:
(3) 
where denotes the horizontal position, height and depth of the 3D point in camera coordinates, and is the projection of the 3D point in 2D image coordinates.
Ground Truth: We define a ground truth (GT) box using the following parameters: the 2D bounding box , where is the center of 2D box and are the width and height of 2D box; the 3D center represents the location of 3D center in camera coordinates; the 3D shapes (3D object dimensions: height, width, length (in meters)), and the allocentric pose in 3D space (observation angle of object, ranging ) [34]. Note that we use the minimum enclosing rectangle of the projected 3D box as our ground truth 2D bounding box.
Outputs: Let denote the number of anchors and denote the number of classes. For each position of the input, the output for an anchor contains parameters: , , where is the predicted 2D box; is the position of the projected 3D corner in the 2D plane, denotes the depth, predicted 3D shape and rotation, respectively; denotes 8 projected 3D corners; denotes the classification score of each class. The size of the output is , where is the size of the input image with a down sampling factor of 16. The output is actually an anchorbased transformation of the 2D3D box.
3.3.2 2D3D Anchor
Inspired by [1], we utilize 2D3D anchors with priors as our default anchor boxes. More specifically, a 2D3D anchor is first defined on the 2D space as in [32] and then use the corresponding priors in the training dataset to calculate the part of it in the 3D space. One template anchor is defined using parameters of both spaces: , where denotes the 3D anchor (depth, shape, rotation).
For 2D anchors , we use 12 different scales ranging from 30 to 400 pixels in height following the power function of and aspect ratios of to define a total of 36 anchors. We then project all ground truth 3D boxes to the 2D space. For each projected box, we calculate its intersection over union (IoU) with each 2D anchor and assign the corresponding 3D box to the anchors that have an IoU . For each 2D anchor, we thus use the statistics across all matching ground truth 3D boxes as its corresponding 3D anchor . Note that we use the same anchor parameters for the regression of and . The anchors enable our network to learn a relative value (residual) of the ground truth, which significantly reduces the difficulty of learning.
3.3.3 Data Transformation
We combine the output of our network which is an anchorbased transformation of the 2D3D box and the predefined anchors to obtain our estimated 3D boxes:
(4) 
where denote respectively the estimated 3D center projection in 2D plane, the depth of 3D center and eight corners, the 3D rotation by combining output of the network and the anchor.
3.3.4 Losses
Our overall loss contains a classification loss, a 2D regression loss, a 3D regression loss and a 2D3D corner loss. We use the idea of focal loss [30] to balance the samples. Let and denote the classification score of target class and the focusing parameter, respectively. We have:
(5) 
where in all experiments, and , , , are the classification loss, 2D regression loss, 3D regression loss and D3D corner loss, respectively.
In this work, we employ the standard crossentropy (CE) loss for classification:
(6) 
Moreover, for both 2D and 3D regression, we simply use the SmoothL1 regression losses:
(7) 
where denotes the projected corners in image coordinates of the GT 3D box and is its GT depth.
Method  Test set  Split1  Split2  

Easy  Moderate  Hard  Easy  Moderate  Hard  Easy  Moderate  Hard  
OFTNet [43]  1.61  1.32  1.00  4.07  3.27  3.29  –  –  – 
FQNet [31]  2.77  1.51  1.01  5.98  5.50  4.75  5.45  5.11  4.45 
ROI10D [34]  4.32  2.02  1.46  10.25  6.39  6.18  –  –  – 
GS3D [26]  4.47  2.90  2.47  13.46  10.97  10.38  11.63  10.51  10.51 
Shift RCNN [37]  6.88  3.87  2.83  13.84  11.29  11.08  –  –  – 
MonoGRNet [41]  9.61  5.74  4.25  13.88  10.19  7.62  –  –  – 
MonoPSR [23]  10.76  7.25  5.85  12.75  11.48  8.59  13.94  12.24  10.77 
Mono3DPLiDAR [50]  10.76  7.50  6.10  31.5  21.00  17.50  –  –  – 
SS3D [21]  10.78  7.68  6.51  14.52  13.15  11.85  9.45  8.42  7.34 
MonoDIS [45]  10.37  7.94  6.40  11.06  7.60  6.37  –  –  – 
PseudoLiDAR [48]  –  –  –  19.50  17.20  16.20  –  –  – 
M3DRPN [1]  14.76  9.71  7.42  20.27  17.06  15.21  20.40  16.48  13.34 
AM3D [1]  16.50  10.74  9.52 (+0.01)  32.23 (+5.26)  21.09  17.26  –  –  – 
DLCN (Ours)  16.65 (+0.15)  11.72 (+0.98)  9.51  26.97  21.71 (+0.62)  18.22 (+0.96)  24.29 (+3.89)  19.54 (+3.06)  16.38 (+3.04) 
4 Experiments
4.1 Dataset and Setting
KITTI Dataset. The KITTI 3D object detection dataset [12] is widely used for monocular and LiDARbased 3D detection. It consists of 7,481 training images and 7,518 test images as well as the corresponding point clouds and the calibration parameters, comprising a total of 80,256 2D3D labeled objects with three object classes: Car, Pedestrian, and Cyclist. Each 3D ground truth box is assigned to one out of three difficulty classes (easy, moderate, hard) according to the occlusion and truncation levels of objects. There are two trainval splits of KITTI: the split1 [5] contains 3,712 training and 3,769 validation images, while the split2 [53] uses 3,682 images for training and 3,799 images for validation. The dataset includes three tasks: 2D detection, 3D detection, and Bird’s eye view, among which 3D detection is the focus of 3D detection methods.
Precisionrecall curves are used for evaluation (with the IoU threshold of 0.7). Prior to Aug. 2019, 11point Interpolated Average Precision (AP) metric
proposed in the Pascal VOC benchmark is separately computed on each difficulty class and each object class. After that, the 40 recall positionsbased metric is used instead of , following [45]. All methods are ranked by of the 3D car detection in the moderate setting.Implementation Details. We use our depthguided filtering module three times on the first three blocks of ResNet, which have different network strides of 4,8,16, respectively. [11] is used for depth estimation. A dropchannel layer with a drop rate of 0.2 is used after each module and a dropout layer with a drop rate of 0.5 is used after the output of the network backbone. For our singlestage detector, we use two convolutional layers as our detection head. The number of channels in the first layer is 512, and for the second layer, where is set to 4 for three object classes and the background class, and is set to 36. Non Maximum Suppression (NMS) with an IoU threshold of 0.4 is used on the network output in 2D space. Since the regression of the 3D rotation is more difficult than other parameters, a hillclimbing postprocessing step is used for optimizing as in [1]. The input images are scaled to and horizontal flipping is the only data augmentation. is set to 2 and the maximum dilation rate is set to 3 in all experiments.
The network is optimized by stochastic gradient descent (SGD), with a momentum of 0.9 and a weight decay of 0.0005. We take a minibatch size of 8 on 4 Nvidia Tesla v100 GPUs (16G). We use the ‘poly’ learning rate policy and set the base learning rate to 0.01 and power to 0.9. The iteration number for the training process is set to 40,000.
4.2 Comparative Results
We conduct experiments on the official test set and two splits of validation set of the KITTI dataset. Table 1 includes the top 14 monocular methods in the leaderboard, among which our method ranks top1. We can observe that: (1) Our method outperforms the secondbest competitor for monocular 3D car detection by a large margin (relatively 9.1% for 10.74 vs. 11.72) under the moderate setting (which is the most important setting of KITTI). (2) Most competitors, such as [23, 33, 45, 37, 50, 1], utilize the detector (e.g. FasterRCNN) pretrained on COCO/KITTI or resort to multistage training to obtain better 2D detection and stable 3D results, while our model is trained endtoend using the standard ImageNet pretrained model. However, we still achieve the stateoftheart 3D detection results, validating the effectiveness of our DLCN to learn 3D structure. (3) Recently KITTI uses instead of , however, all existing methods report the results under the old metric. We thus also give results under on the validation set for fair comparison. It can be seen that our method outperforms all others on the two splits for 3D car detection. Our results under on validation set are shown in ablation study.
Method  Task  

Easy  Moderate  Hard  Easy  Moderate  Hard  
3DNet  2D detection  93.42  85.16  68.14  94.13  84.45  65.73 
3D detection  17.94  14.61  12.74  16.72  12.13  09.46  
Bird’seye view  24.87  19.89  16.14  23.19  16.67  13.39  
+CL  2D detection  94.04  85.56  68.50  94.98  84.93  66.11 
3D detection  20.66  15.57  13.41  17.10  12.09  09.47  
Bird’seye view  2903  23.82  19.41  24.12  17.75  13.66  
+DLCN  2D detection  92.98  85.35  68.63  93.81  86.71  70.19 
3D detection  23.25  17.92  15.58  18.32  13.50  10.61  
Bird’seye view  27.76  22.89  18.73  26.78  18.68  15.14  
+SP  2D detection  92.57  85.14  68.40  93.35  86.52  67.93 
3D detection  25.30  19.02  17.26  19.69  14.44  11.52  
Bird’seye view  31.39  24.40  19.85  26.91  20.07  15.77  
DLCN  2D detection  93.59  85.51  68.81  94.25  86.93  70.34 
3D detection  26.97  21.71  18.22  22.32  16.20  12.30  
Bird’seye view  34.82  25.83  23.53  31.53  22.58  17.87 
4.3 Detailed Analysis
4.3.1 Ablation Study
To conduct ablation study on our model, we make comparison among five versions of our model: (1) 3DNet: the baseline model using and without our depthguided filtering module; (2) + CL: the Corner Loss is added to 3DNet; (3) + DLCN: depthguided depthwise local filtering is added; (4) + SP: shiftpooling operator is added (with ); (5) DLCN (our full model): adaptive dilation rates are added, as in Eq. 2. From Table 2, we can observe that: (1) The performance continuously increases when more components are used for 3D object detection, showing the contribution of each component. (2) Our depthguided filtering module increases the 3D detection AP scores (moderate) from {15.57, 12.09} to {21.71, 16.20} w.r.t. the and metrics, respectively. This suggests that it is indeed effective to capture the meaningful local structure for 3D object detection. (3) The main improvement comes from our adaptive dilated convolution (2.69 and 1.76 for and , respectively), which allows each channel of the feature map to have different receptive fields and thus solves the scalesensitive problem. Note that we have tried different values of , and found that is the best.
4.3.2 Evaluation of Depth Maps
To study the impact of accuracy of depth maps on the performance of our method, we extract depth maps using four different methods [13, 11, 35, 3] and then apply them to 3D detection. As reported in previous works on depth estimation, the three supervised methods (i.e. PSMNet, DispNet, and DORN) significantly outperform the unsupervised method [13]. Among the supervised methods, Stereobased methods [3, 35] are better than monocularbased DORN. With these conclusions, we have the following observations from Table 3
: (1) The accuracy of 3D detection is higher with better depth map. This is because that better depth map can provide better scene geometry and local structure. (2) As the quality of depth map increases, the growth of detection accuracy becomes slower. (3) Even with the depth maps obtained by unsupervised learning
[13], our method achieves stateoftheart results. Compared to the pseudolidar based method [33], our method relies less on the quality of depth maps (19.63 vs. 15.45 using MonoDepth).Depth  

Easy  Moderate  Hard  Easy  Moderate  Hard  
MonoDepth [13]  22.43  19.63  16.38  16.82  13.18  10.87 
DORN [11]  26.97  21.71  18.22  22.32  16.20  12.30 
DispNet [35]  30.95  24.06  20.29  25.73  18.56  15.10 
PSMNet [3]  30.03  25.41  21.63  25.24  19.80  16.45 
Conv module  

Easy  Moderate  Hard  Easy  Moderate  Hard  
Dynamic [20]  23.01  17.67  15.85  17.47  12.18  09.53 
Dynamic Local [20]  25.15  18.42  16.27  21.09  13.93  11.31 
Deformable [9]  23.98  18.24  16.11  19.05  13.42  10.07 
DLCN (ours)  26.97  21.71  18.22  22.32  16.20  12.30 
4.3.3 Evaluation of Convolutional Appoaches
To show the effectiveness of our guided filtering module for 3D object detection, we compare it with several alternatives: Dynamic Convolution [20], Dynamic Local Filtering [20], and Deformable Convolution [9]. Our method belongs to dynamic networks but yields less computation cost and stronger representation. For the first two methods, we conduct experiments using the same depth map as ours. For the third method, we apply deformable convolution on both RGB and depth branches and merge them by elementwise product. From Table 4, we can observe that our method performs the best. This indicates that our method can better capture 3D information from RGB images due to the special design of our DLCN.
Class  Easy  Moderate  Hard 

[split1/split2/test]  [split1/split2/test]  [split1/split2/test]  
Car  26.97/24.29/16.65  21.71/19.54/11.72  18.22/16.38/9.51 
Pedestrian  12.95/12.52/4.55  11.23/10.37/3.42  11.05/10.23/2.83 
Cyclist  5.85/7.05/2.45  4.41/6.54/1.67  4.14/6.54/1.36 
4.3.4 MultiClass 3D Detection
Since a person is a nonrigid body, its shape varies and its depth information is hard to accurately estimate. For this reason, 3D detection of pedestrians and cyclists becomes particularly difficult. Note that all pseudoLiDAR based methods [33, 50, 48] fail to detect these two categories. However, as shown in Table 5, our method still achieves satisfactory performance on 3D detection of pedestrians and cyclists. Moreover, we also show the active maps corresponding to different filters of our DLCN in Figure 5. Different filters on the same layer of our model use different sizes of receptive fields to handle objects of different scales, including pedestrians (small) and cars (big), as well as distant cars (big) and nearby cars (small).
5 Conclusion
In this paper, we propose a Depthguided DynamicDepthwiseDilated Local ConvNet (DLCN) for monocular 3D objection detection, where the convolutional kernels and their receptive fields (dilation rates) are different for different pixels and channels of different images. These kernels are generated dynamically conditioned on the depth map to compensate the limitations of 2D convolution and narrow the gap between 2D convolutions and the point cloudbased 3D operators. As a result, our DLCN can not only address the problem of the scalesensitive and meaningless local structure of 2D convolutions, but also benefit from the highlevel semantic information from RGB images. Extensive experiments show that our DLCN better captures 3D information and ranks for monocular 3D object detection on the KITTI dataset at the time of submission.
6 Acknowledgements
We would like to thank Dr. Guorun Yang for his careful proofreading. Ping Luo is partially supported by the HKU Seed Funding for Basic Research and SenseTime’s Donation for Basic Research. Zhiwu Lu is partially supported by National Natural Science Foundation of China (61976220, 61832017, and 61573363), and Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098).
APPENDIX
Appendix A Definition of 3D Corners
We define the eight corners of each ground truth box as follows:
(8) 
where in a defined order, and is the egocentric rotation matrix. Note that we use allocentric pose for regression.
Appendix B Comparisons between Two Rotation Definitions
As shown in Figure 6, while egocentric poses undergo viewpoint changes towards the camera when translated, allocentric poses always exhibit the same view, independent of the object’s location. The allocentric pose and the egocentric pose can be converted to each other according to the viewing angle .
(9) 
Method  Depth  CAD  Points  Freespace  Segmentation  Pretrain/MST  Endtoend  

✓  
ROI10D [34]  ✓  ✓  
MultiLevel Fusion [54], DLCN (Ours)  ✓  ✓  

✓  

✓  ✓  
DeepMANTA [2]  ✓  ✓  
3DVP [53]  ✓  ✓  
Mono3D [4, 6]  ✓  ✓  ✓  ✓  
Mono3D++ [17]  ✓  ✓  ✓ 
Conv Method  Dynamic  Local  Depthwise  Shiftpooling  Dilated  

Easy  Moderate  Hard  Easy  Moderate  Hard  
ConvNet  20.66  15.57  13.41  17.10  12.09  09.47  
Depthguided CN  ✓  23.01  17.67  15.85  17.47  12.18  09.53  
Depthguided LCN  ✓  ✓  25.15  18.42  16.27  21.09  13.93  11.31  
Depthguided DLCN  ✓  ✓  ✓  23.25  17.92  15.58  18.32  13.50  10.61  
Depthguided SPDLCN  ✓  ✓  ✓  ✓  25.30  19.02  17.26  19.69  14.44  11.52  
DLCN  ✓  ✓  ✓  ✓  ✓  26.97  21.71  18.22  22.32  16.20  12.30 
Appendix C Ablative Results for Convolutional Methods
The Depthguided filtering module in our DLCN model can be decomposed into basic convolutional components:

Traditional Convolutional Network

Depthguided ConvNet (CN)

Depthguided Local CN (LCN)

Depthguided Depthwise LCN (DLCN)

Depthguided DLCN with Shiftpooling (SPDLCN)

DLCN (Our full model)
The ablative results for these convolutional methods are shown in Table 7. We can observe that: (1) Using the depth map to guide the convolution of each pixel brings a considerable improvement. (2) Depthwise convolution with shiftpooling operator not only has fewer parameters (Section 3.2 of our main paper) but also gets better performance than the standard convolution. (3) The main improvement comes from our adaptive dilated convolution, which allows each channel of the feature map to have different receptive fields.
Appendix D Comparisons of Labeling Information and Training Strategies
We compare the labeling information and training strategies used in different monocular detection methods, as shown in Table 6.
It can be seen that: (1) our model outperforms all existing methods by only using the depth map extracted from the monocular image. (2) our model can be trained in an endtoend manner.
Appendix E Distributions of Different Dilation
We show the average ratio of different channels with different dilation rates in three blocks of our model over the validation set of split1 (Figure 7). It can be seen that: (1) For the first block with insufficient receptive field, the model tends to increase the receptive field by large dilation rate, and then it uses small receptive field for the second block. (2) In the third block, the model uses three different dilation rates evenly to deal with the object detection of different scales. We also show the active maps corresponding to different filters of the third block of our DLCN in our main paper (Figure 5).
References
 [1] (2019) M3Drpn: monocular 3d region proposal network for object detection. In ICCV, pp. 9287–9296. Cited by: Table 6, §2, §2, §3.3.2, Table 1, §4.1, §4.2.
 [2] (2017) Deep manta: a coarsetofine manytask network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, pp. 2040–2049. Cited by: Table 6, §2.
 [3] (2018) Pyramid stereo matching network. In CVPR, pp. 5410–5418. Cited by: §4.3.2, Table 3.
 [4] (2016) Monocular 3d object detection for autonomous driving. In CVPR, pp. 2147–2156. Cited by: Table 6, §1, §2.
 [5] (2015) 3d object proposals for accurate object class detection. In NeurIPS, pp. 424–432. Cited by: §1, §4.1.
 [6] (2017) 3d object proposals using stereo imagery for accurate object class detection. TPAMI 40 (5), pp. 1259–1272. Cited by: Table 6.
 [7] (2017) Multiview 3d object detection network for autonomous driving. In CVPR, pp. 1907–1915. Cited by: §2.
 [8] (2019) Fast point rcnn. In ICCV, Cited by: §2.
 [9] (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §2, §4.3.3, Table 4.
 [10] (2009) Imagenet: a largescale hierarchical image database. In CVPR, pp. 248–255. Cited by: §3.1.
 [11] (2018) Deep ordinal regression network for monocular depth estimation. In CVPR, pp. 2002–2011. Cited by: Figure 1, §4.1, §4.3.2, Table 3.
 [12] (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: §1, §4.1.
 [13] (2017) Unsupervised monocular depth estimation with leftright consistency. In CVPR, Cited by: Figure 1, §4.3.2, Table 3.
 [14] (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §2, §3.2.
 [15] (2010) Guided image filtering. In ECCV, pp. 1–14. Cited by: §1.
 [16] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.1.
 [17] (2019) Mono3d++: monocular 3d vehicle detection with twoscale 3d hypotheses and task priors. arXiv preprint arXiv:1901.03446. Cited by: Table 6, §1, §2.

[18]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §1, §3.2, §3.2.  [19] (2019) Joint monocular 3d vehicle detection and tracking. In ICCV, pp. 5390–5399. Cited by: §1, §2.
 [20] (2016) Dynamic filter networks. In NeurIPS, pp. 667–675. Cited by: §1, §2, §3.2, §4.3.3, Table 4.
 [21] (2019) Monocular 3d object detection and box fitting trained endtoend using intersectionoverunion loss. arXiv preprint arXiv:1906.08070. Cited by: Table 6, §2, Table 1.
 [22] (2018) Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. Cited by: §2.
 [23] (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In CVPR, pp. 11867–11876. Cited by: Table 6, §2, Table 1, §4.2.
 [24] (2018) 3drcnn: instancelevel 3d object reconstruction via renderandcompare. In CVPR, pp. 3559–3568. Cited by: §2.
 [25] (2019) PointPillars: fast encoders for object detection from point clouds. In CVPR, pp. 12697–12705. Cited by: §2.
 [26] (2019) GS3D: an efficient 3d object detection framework for autonomous driving. In CVPR, pp. 1019–1028. Cited by: Table 6, §1, §2, Table 1.
 [27] (2019) Scaleaware trident networks for object detection. In ICCV, Cited by: §2.
 [28] (2019) Multitask multisensor fusion for 3d object detection. In CVPR, pp. 7345–7353. Cited by: §2.
 [29] (2018) Deep continuous fusion for multisensor 3d object detection. In ECCV, pp. 641–656. Cited by: §2.
 [30] (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §3.3.4.
 [31] (2019) Deep fitting degree scoring network for monocular 3d object detection. In CVPR, pp. 1057–1066. Cited by: Table 6, §2, Table 1.
 [32] (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §3.3.2, §3.3.
 [33] (2019) Accurate monocular 3d object detection via colorembedded 3d reconstruction for autonomous driving. In ICCV, pp. 6851–6860. Cited by: Table 6, §1, §2, §4.2, §4.3.2, §4.3.4, Table 5.
 [34] (2019) Roi10d: monocular lifting of 2d detection to 6d pose and metric shape. In CVPR, pp. 2069–2078. Cited by: Table 6, §2, §3.3.1, Table 1.
 [35] (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, pp. 4040–4048. Cited by: §4.3.2, Table 3.
 [36] (2017) 3d bounding box estimation using deep learning and geometry. In CVPR, pp. 7074–7082. Cited by: Table 6, §1, §2.
 [37] (2019) Shift rcnn: deep monocular 3d object detection with closedform geometric constraints. arXiv preprint arXiv:1905.09970. Cited by: Table 6, Table 1, §4.2.
 [38] (2018) Frustum pointnets for 3d object detection from rgbd data. In CVPR, pp. 918–927. Cited by: §2.
 [39] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660. Cited by: §2.
 [40] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, pp. 5099–5108. Cited by: §2.
 [41] (2019) Monogrnet: a geometric reasoning network for monocular 3d object localization. In AAAI, Vol. 33, pp. 8851–8858. Cited by: Table 6, §1, §2, Table 1.
 [42] (2016) You only look once: unified, realtime object detection. In CVPR, pp. 779–788. Cited by: §3.3.
 [43] (2018) Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188. Cited by: Table 6, §2, Table 1.
 [44] (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, pp. 770–779. Cited by: §2.
 [45] (2019) Disentangling monocular 3d object detection. arXiv preprint arXiv:1905.12365. Cited by: Table 6, Table 1, §4.1, §4.2.
 [46] (2019) Learning guided convolutional network for depth completion. arXiv preprint arXiv:1908.01238. Cited by: §2.
 [47] (2019) Voxelfpn: multiscale voxel feature aggregation in 3d object detection from point clouds. arXiv preprint arXiv:1907.05286. Cited by: §2.
 [48] (2019) Pseudolidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In CVPR, pp. 8445–8453. Cited by: Table 6, Figure 1, §1, §2, Table 1, §4.3.4, Table 5.
 [49] (2019) Frustum convnet: sliding frustums to aggregate local pointwise features for amodal 3d object detection. arXiv preprint arXiv:1903.01864. Cited by: §2.
 [50] (2019) Monocular 3d object detection with pseudolidar point cloud. arXiv preprint arXiv:1903.09847. Cited by: Table 6, §1, §2, Table 1, §4.2, §4.3.4, Table 5.
 [51] (2018) Shift: a zero flop, zero parameter alternative to spatial convolutions. In CVPR, pp. 9127–9135. Cited by: §3.2.
 [52] (2018) Dynamic filtering with large sampling field for convnets. In ECCV, pp. 185–200. Cited by: §2.
 [53] (2015) Datadriven 3d voxel patterns for object category recognition. In CVPR, pp. 1903–1911. Cited by: Table 6, §2, §4.1.
 [54] (2018) Multilevel fusion based 3d object detection from monocular images. In CVPR, pp. 2345–2353. Cited by: Table 6, §2.
 [55] (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §2.
 [56] (2015) Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §3.2.
 [57] (2014) Are cars just 3d boxes?jointly estimating the 3d shape of multiple objects. In CVPR, pp. 3678–3685. Cited by: §2.
 [58] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, pp. 6848–6856. Cited by: §3.2.
 [59] (2018) Voxelnet: endtoend learning for point cloud based 3d object detection. In CVPR, pp. 4490–4499. Cited by: §2.