DeepAI
Log In Sign Up

Paint and Distill: Boosting 3D Object Detection with Semantic Passing Network

07/12/2022
by   Bo Ju, et al.
Baidu, Inc.
0

3D object detection task from lidar or camera sensors is essential for autonomous driving. Pioneer attempts at multi-modality fusion complement the sparse lidar point clouds with rich semantic texture information from images at the cost of extra network designs and overhead. In this work, we propose a novel semantic passing framework, named SPNet, to boost the performance of existing lidar-based 3D detection models with the guidance of rich context painting, with no extra computation cost during inference. Our key design is to first exploit the potential instructive semantic knowledge within the ground-truth labels by training a semantic-painted teacher model and then guide the pure-lidar network to learn the semantic-painted representation via knowledge passing modules at different granularities: class-wise passing, pixel-wise passing and instance-wise passing. Experimental results show that the proposed SPNet can seamlessly cooperate with most existing 3D detection frameworks with 1 5 performance on the KITTI test benchmark. Code is available at: https://github.com/jb892/SPNet.

READ FULL TEXT VIEW PDF

page 4

page 8

05/30/2022

Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection

There are two critical sensors for 3D perception in autonomous driving, ...
09/22/2022

FusionRCNN: LiDAR-Camera Fusion for Two-stage 3D Object Detection

3D object detection with multi-sensors is essential for an accurate and ...
03/01/2020

ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection

3D object detection is an essential task in autonomous driving and robot...
07/19/2022

Det6D: A Ground-Aware Full-Pose 3D Object Detector for Improving Terrain Robustness

Accurate 3D object detection with LiDAR is critical for autonomous drivi...
12/30/2020

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

Although the recent image-based 3D object detection methods using Pseudo...
06/01/2022

Unifying Voxel-based Representation with Transformer for 3D Object Detection

In this work, we present a unified framework for multi-modality 3D objec...
07/17/2020

EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection

In this paper, we aim at addressing two critical issues in the 3D detect...

1. Introduction

3D object detection aims to estimate the informative 3D bounding boxes from lidar or camera sensors. It is a critical task and has received substantial attention for its wide applications in autonomous driving and robotics navigation. Recently emerged 3D object detection approaches mainly focus on learning discriminative 3D representations on the point clouds collected from LiDAR sensors, which provide accurate distance measurements and geometric cues for perceiving the surrounding environment. 3D point clouds are usually sparse, unordered, and unevenly distributed by nature. As a result, many approaches

(Shi et al., 2020; Li et al., 2021a, b; Mao et al., 2021b)

parse the sparse 3D point clouds into a compact representation and apply convolutional neural networks (CNN) to learn the structured features and explicitly explore the 3D shape information. Thanks to the robust representation ability of CNNs, remarkable progress has been achieved in promoting the performance on the existing mainstream benchmarks

(Geiger et al., 2013; Caesar et al., 2020). However, LiDAR points lack rich texture attributes and fine-grained shape knowledge of the objects, which limit the generalization performance of pure-lidar-based methods. Take the left part in Figure 1 as an example. We find that the LiDAR-based approach Pointpillars (Lang et al., 2019) performs well on the easy and visible objects while the localization capability deteriorates for the far-away or small objects, causing undesired false positives.

To alleviate the issues caused by the lack of semantic texture information in the point clouds, some approaches (Xu et al., 2018; Wang et al., 2021; Yin et al., 2021) try to explore the cross-modal fusion of RGB cameras and LiDAR sensors for complementarity to improve the performance. Compared to point clouds, RGB images contain informative colour and texture information but fail to recover the depth. The complementary nature of the two modalities makes them beneficial to each other and jointly leads to a win-win situation. Despite the promising results, these fusion-based methods suffer from three potential difficulties. First, LiDAR and camera image contain highly different data characteristics, making it non-trivial to combine the features from these two misaligned views. Second, the more complex multi-modal networks require extra computation head compared to single modality, which is undesirable for embedded systems. Third, the camera sensor is susceptible to the external influence, such as weather and light conditions, which may introduce undesired noise that hurts the final localization accuracy. Given the problems mentioned above, we wonder if there is a way to dig into the underlying semantic information in the ground-truth label itself and eliminate the dependence on RGB images?

Towards this goal, we design a paint-to-distill framework in this paper, which explores the inherent semantics of objects to boost the performance of existing 3D detectors without extra inference cost. Our work is inspired by the observations in the early stage fusion method PointPainting (Vora et al., 2020). It proposes a simple painting method that first assigns each point with the semantic label predicted from the semantic segmentation predictions on the 2D images and then trains the whole network using the painted point clouds as inputs. Although notable improvement is achieved on pedestrian and cyclist classes, it barely improves the performance of the commonly used car category. We claim that the dissatisfactory results are mainly bottle-necked by the imperfect segmentation outputs. Hence, we introduce GT-Painting, which paints the point clouds using the ground-truth bounding boxes and removes the dependence on RGB images. Since the ground truth bounding boxes can provide totally correct category semantic label, we can reach the potential performance upper bound of painting-based methods leveraging the ground truth semantic information rather than predicted segmentation clues. Experimentally, we demonstrate that using GT-Painting can improve the performance on the hard examples shown in the right part of Figure 1. In addition, considering the moderate difficulty level, it achieves over 10% AP gain of car category than the baseline model on the KITTI validation set.

Since the ground truth is not available during the inference stage, we design a novel distillation framework, named Semantic Passing Network (SPNet), to alleviate the labels dependence by distilling target-specific knowledge. Firstly, we utilize the GT-painted point clouds for training a more robust teacher model. Then combine the ground truth bounding boxes and the deep features of various granularities in the pre-trained teacher to guide the student learning, which only accepts original point clouds as inputs. Since the features of the teacher model contain rich semantic information, this additional guidance can be complementary to the default box supervision in the training stage and discarded in the inference stage, which jointly contribute to the final performance. Specifically, we design three distillation schemes to pass the semantic information between models: (i) class-wise passing module that models the group information in the 3D representations and provides guidance on clustering; (ii) pixel-wise passing module that aligns the feature maps in the BEV view; (iii) instance-wise passing module that closes the gap in the spatial distribution of the model outputs. The benefit of the above complementary information passing modules is that the teacher net and student net can be roundly aligned, which assists the student to distill much richer semantic information from the strong teacher. To verify the effectiveness of the proposed model, we plug our SPNet into three existing 3D object detectors and achieve consistent performance improvement. Note that our best model achieves new state-of-the-art 3D object detection performance on the KITTI benchmark.

In conclusion, the major contributions are as follows:

  • We propose a novel plug-and-play framework named Semantic Passing Network (SPNet), which uses the semantic information of objects to boost the performance of existing 3D object detectors at negligible cost.

  • We design three complementary information passing modules: class-wise passing, pixel-wise passing, instance-wise passing, which provide auxiliary supervision signals which serve as strong guidance to transfer the rich semantics from the strong teacher to the student model.

  • We achieve new state-of-the-art 3D object detection performance on the KITTI benchmark. Besides, we plug our model into three existing detectors and achieve significant improvements, with 2.84%, 4.97% and 1.05% moderate AP increase on the KITTI benchmark.

2. Related Work

2.1. 3D Detection

3D object detectors can be roughly divided into three streams: point-based, voxel-based and point-voxel-based, depending on how to transform point clouds into 3D representations for localizing objects. First, point-based detectors (Qi et al., 2018; Yang et al., 2019; Shi et al., 2019; Shi and Rajkumar, 2020; Pan et al., 2021; Chen et al., 2022) operate directly on raw point clouds to generate 3D boxes. F-PointNet (Qi et al., 2018) is a pioneering work that utilizes frustums for region-level features generation. PointRCNN (Shi et al., 2019) directly segments 3D point clouds to obtain foreground points and utilizes the segmentation features to refine the proposals. Pointformer (Pan et al., 2021) proposes a local-global transformer to build globally-adaptive point representations by integrating local features with global features. SASA (Chen et al., 2022) finds that the widely-used set abstraction (SA) is inefficient to describe scenes in the context of detection and thus incorporates point-wise semantic cues to focus on more informative foreground points. Second, voxel-based detectors (Zhou and Tuzel, 2018; Zhu et al., 2020; Deng et al., 2020; Li et al., 2021a; Mao et al., 2021b) aim to transform the unstructured point clouds into regular grids over which conventional CNNs can be easily applied. VoxelNet (Zhou and Tuzel, 2018) encodes the points to form a compact feature representation and produces detection results by using region proposal networks. Voxel R-CNN (Deng et al., 2020) designs a new voxel RoI pooling operation to extract neighbouring voxel features, which are taken for further box refinement. VoTr (Mao et al., 2021b) further proposes the local self-attention module to replace the sparse convolution for voxel processing, which can be applied in most voxel-based detectors to boost the performance. Differently, CenterPoint (Yin et al., 2021) proposes an anchor-free 3D detection framework, which uses a keypoint detector to regress centers of objects and other attributes. To integrate the best of the two representations, a series of point-voxel based methods (Shi et al., 2020; Li et al., 2021b; Sheng et al., 2021; Noh et al., 2021) have been proposed. The representative work PV-RCNN (Shi et al., 2020) utilizes a novel voxel set abstraction module to encode representative scene features while proposing the RoI grid pooling to abstract proposal-specific features from the keypoints. These two innovative operations bring definite improvements. PV-RCNN++ (Shi et al., 2021)

further replaces the RoI grid pooling with Vector-Pooling to efficiently collect the key point features from different orientations, which improves the performance with less resource consumption and faster running speed. To tackle the shape misses challenge caused by self-occlusion, BtcDet

(Xu et al., 2021) learns the object shape priors and estimates the complete object shapes that are partially curtained in point clouds. Thus, using the recovered shape miss can improve the 3D detection performance.

Figure 2. The overview of our SPNet framework. It contains two branches: Student (Top) and Teacher (Bottom) with identical network structure. The student and teacher branches take raw point cloud and GT-painted point cloud into the network, respectively. The painted semantic representations learned by the teacher network are passed to student network at different granularity levels , , . Note that the teacher model is discarded in inference.

2.2. Knowledge Distillation

Knowledge distillation has been extensively studied in recent years. The concept is popularized by Hinton et al. (Hinton et al., 2015) that transfers the generalization ability from the teacher network to the student network through the soft targets. This procedure is realized by utilizing a distillation loss that considers both the ground truth and the prediction of the pre-trained teacher. Similar works can also be found in (Breiman and Shang, 1996; Fukuda et al., 2017)

but are designed mainly for classifiers. Since knowledge distillation can improve the performance of the student model while maintaining inference efficiency, it has been widely investigated in a variety of computer vision tasks, including object detection

(Hao et al., 2020; Guo et al., 2021; Dai et al., 2021), semantic segmentation (Liu et al., 2019a; Jiao et al., 2019; Shu et al., 2020), object tracking (Liu et al., 2019b; Dunnhofer et al., 2021), depth estimation (Hu et al., 2021; Chen et al., 2021). Due to the differences in backbones, it is not trivial to directly apply 2D methods to 3D tasks. As far as we know, there are few works (Wang et al., 2020; Yi et al., 2022) that exploit knowledge distillation in LiDAR-based 3D object detection. Wang et al. (Wang et al., 2020) first train a teacher detection model on dense point clouds generated from multiple frames and then transfer the knowledge in the pre-trained teacher to the student network, which accepts sparse point clouds as inputs. Wei et al. (Yi et al., 2022) propose the LiDAR distillation method to mitigate the beam-induced domain shift induced by different LiDAR beams. Using extra semantic information as guidance, these methods tend to solve the serious sparseness problem in the point clouds. This paper focuses on exploiting the rich semantic information existing in the ground truth labels, which can boost the performance while maintaining the same setting as mainstream 3D object detectors.

3. Methodology

This paper presents a novel Semantic Passing Network (SPNet) for 3D object detection, depicted in Figure 2. The core idea of this work is to make full use of the semantic information hidden in objects and map the semantics into auxiliary supervision signals to pass the instructive knowledge, which can boost the performance of existing 3D object detectors. Given original point clouds, we first use GT-painting to add the semantic class indicator on the point clouds and then use our SPNet to transfer the informative knowledge from the painted point clouds to the original point clouds.

In the following section, we shall introduce the GT-Painting method in Sec. 3.1. Then, we give a thorough introduction about the proposed SPNet framework architecture in Sec. 3.2. In the last Sec. 3.3, we further explain the training losses of SPNet.

3.1. GT-Painting

Although LiDAR can provide depth information close to linearity error for accurate localization, it is limited by the lack of texture information. To resolve this issue, PointPainting (Vora et al., 2020) proposes to leverage the rich semantics contained in RGB images to consolidate point clouds. Specifically, it appends the segmentation scores from per pixel predictions of the RGB image to the last dimension of the corresponding point cloud. As a result, the LiDAR-based 3D detectors benefit from the camera modality’s rich semantic cues. However, through our observation, the performance of such a painting method is largely dependent on the accuracy of the 2D semantic segmentation result. Experiments show that the painting method could bring 2% mAP performance improvement if we use a robust 2D semantic segmentation model like HRNet-OCR (Yuan et al., 2020). The improvements vanish if we adopt a weaker segmentation model like HRNet (Sun et al., 2019). So, we design the GT-Painting to find the upper bound performance of the PointPainting method by directly using ground-truth boxes to provide the best semantic indicator and paint the point cloud. In particular, we define each raw input point cloud as and each point as , where represent the spatial location of each point, is the LiDAR reflectance intensity, and is the number of total points. We generate the semantic class indicator by examining if the points are located inside any of the bounding boxes. Inner points with a specific class are marked as the corresponding numerical values and denoted as foreground. Otherwise, the points are marked as background with numerical value zero. Finally, we concatenate the raw point cloud with the semantic binary indicator in the channel dimension to get the painted point clouds . We feed these augmented point clouds to the baseline model PointPillars (Lang et al., 2019) for training and achieve about 10% moderate AP increase on the KITTI validation set. For the ground truth bounding boxes are not available in the inference stage, we design SPNet to alleviate the labels dependence while maintaining the privilege of exploiting rich semantic information.

3.2. SPNet Architecture

Overview The overview of the proposed SPNet framework structure is shown in Figure 2. It follows a teacher-student paradigm, where the two models share the same structure. Since our network can be applied to most mainstream 3D detection models, we present our work using a general 3D detection framework rather than targeting a specific network. The model consists of three components: (i) a sparse convolution network (3D backbone) that transforms the input point clouds into 3D features for characterizing local 3D shape information; (ii) an hourglass convolution network (BEV feature extractor) that flattens the 3D features into 2D BEV view and extract BEV representations full of the context information; (iii) a region proposal network (RPN) that fuses high level abstract features to predict the category of objects and regress 3D bounding boxes simultaneously. In the first step, we use GT-Painting to decorate an input point cloud with ground truth labels and train the teacher model using painted point clouds as inputs. In the second step, the teacher model is initialized with pre-trained weights obtained earlier, while the student is initialized with random parameters. Finally, we use the features of the teacher model as additional supervision signals, which joint with the ground truth bounding boxes to supervise the student learning during the training time. In order to promote the sufficient delivery of semantic information between the two models, we design three complementary distillation modules: class-wise passing module, pixel-wise passing module and instance-wise passing module. These modules are applied to the different stages in the student model to align the differences between the two models from different granularities.

Class-wise Passing Module

3D detection aims to estimate the 3D bounding box of the objects, whose essence is to perceive the clustering information of the point cloud and group points of the same type together. Such group-related information is often reflected in the feature distribution, so the difference in the clustering representation ability of different models is actually due to the feature distribution variance. To eliminate this variation, the straightforward way is to directly add a penalty term between the features of the two models to reduce the distance of the feature space. However, we claim that the features without explicit modelling will push the student model to mimic the teacher from a pixel-level perspective while losing global semantic information, which leads to the student being trapped in a local minimum. In fact, each point in the clustering has a strong semantic connection with the class center for a certain class of objects, which can be fully utilized to better reduce the feature distribution variation. Hence, we propose a class-wise passing module to transfer relation semantics from the teacher to the student model by modelling high-order statistics. Specifically, given the 3D features

, taken from the 3D backbone, we first compress them along the gravitational dimension to get the BEV shape features , , where . BEV perspective provides dense features with richer spatial information while preserving 3D inherent structural information, which allows us to better compute constraint relations on the feature maps. In order to distinguish the objects of different classes, we compute the binary foreground mask for each class using the ground truth bounding boxes, where represents the specific object class. Next, we calculate the feature center of each class, which is realized by masking the feature maps of the same class and applying the average pooling to aggregate the features along the spatial dimension.

(1)

Then the center vector is unsqueezed to the same resolution as the mask, represented as . We combine the center map and the background using the mask as guidance to capture long-range global context information, which is:

(2)

In the same way, we can get the center vector and further obtain the global feature map for the teacher. To model the inner-class feature constraints, we compute the cluster similarity maps

by establishing cosine similarity matrix between the global feature maps and the original BEV features for both teacher and student features.

(3)

where the value is 1e-6 to prevent zero division. Since the similarity matrix contains rich semantics about the group relations, we calculate the class-wise passing loss between cosine similarity matrix of teacher and student, which is:

(4)

where represents the total categories used in the training.

Pixel-wise Passing Module Spatial information is a key component of the decisive factors for the accurate localization. Thus, we design a pixel level alignment method, named pixel-wise passing module, to align structured information among spatial locations between the teacher and student models. In specific, we take the BEV feature map produced from the compact BEV feature extractor as input where is the feature channel of . By minimizing the L2 distance between two feature maps, the high-level instructive information contained in the deep features flow from teacher to student spatially. However, the background regions often occupy a significant amount of contribution in feature loss, which is unnecessary and make the pixel-wise passing module less effective. To mitigate the effects of such noise, we produce the foreground binary feature mask

using the labels, which makess the loss calculation only focuses on foreground area. The loss function can be formulated as:

(5)

Instance-wise Passing Module Prediction differences carry vital information on why students lag behind teachers, which can further complement the guidance of class-wise and pixel-wise imitation. Many pioneer works (Wang et al., 2019; Guo et al., 2021; Zhou et al., 2021) have proven that labels generated from a better domain can provide additional knowledge guidance at a fine-grained level of supervision instead of only using hard labels as direct supervision. Hence, we design an instance-wise passing module that helps to increase the consistency between teacher and student branches in prediction level. The input of the instance-wise passing module is the class prediction score maps of the teacher and student model, where is the number of anchors at each prediction grid. First, we adopt the masked Kullback-Leibler (KL) divergence loss to compute the distance between teacher and student prediction for both foreground and background. Then, we re-weight the foreground and background loss by using the empirical value and respectively. Finally, we add up two losses to get the final loss for the instance-wise passing module.

(6)

where and represent the mask of the background and foreground, respectively. The masked K-L divergence loss can be described as:

(7)

3.3. Overall loss function

We strictly follow the detection losses used in the adopted baselines. Take PointPillars (Lang et al., 2019) as a representative, we use focal loss for object classification, a softmax classification loss on the discretized directions and smooth L1 loss for the localization regression residuals between ground truth and anchors. More details can be found in complementary details. The total 3D detection loss is therefore:

(8)

Besides, we combine the distill losses from three information passing modules with the 3D detection losses to optimize the student model jointly. The total loss can be defined as:

(9)

4. Experiments

4.1. Setup

Dataset We evaluate our method on the challenging KITTI dataset (Geiger et al., 2013), which consists of 7481 training samples and 7518 testing samples collected from autonomous driving scenes. Each sample provides both the point cloud and the camera image, while our method only uses the point cloud as input. Following the frequently used train/val split mentioned in the previous work (Chen et al., 2017), we divide the training samples into train split of 3712 samples and val split of 3769 samples, respectively. We conduct ablation studies based on this split and report test results trained with all 7481 samples on the KITTI benchmark.

Metrics The KITTI dataset evaluates both the average precision (AP) performance of bird’s eye view (BEV) detection and 3D object detection. The labels are divided into three difficulty levels: easy, moderate and hard based on the basis of object size, occlusion and truncation levels. We report the 40 recall positions-based metric (R40) on the official KITTI test server and the 11 recall positions-based metric (R11) on the validation set for a fair comparison with previous works. We mainly focus on the Car category while also presenting Pedestrian and Cyclist performances. The rotated IoU threshold for three categories are 0.7, 0.7 and 0.5, respectively.

Method Reference 3D Detection (%)
Mod. Easy Hard
LiDAR + RGB:
MV3D (Chen et al., 2017) CVPR 2017 63.63 74.97 54.00
F-PointNet (Qi et al., 2018) CVPR 2018 69.79 82.19 60.59
Painting (Vora et al., 2020) CVPR 2020 71.70 82.11 67.08
F-ConvNet(Wang and Jia, 2019) IROS 2019 76.39 87.36 66.69
MMF (Liang et al., 2019) CVPR 2019 77.43 88.40 70.22
3D-CVF (Yoo et al., 2020) ECCV 2020 80.05 89.20 73.11
CLOCs (Pang et al., 2020) IROS 2020 80.67 88.94 77.15
LiDAR only:
VoxelNet (Zhou and Tuzel, 2018) CVPR 2018 64.17 77.82 57.51
SECOND (Yan et al., 2018) Sensor 2018 72.55 83.34 65.82
PointRCNN (Shi et al., 2019) CVPR 2019 75.64 86.96 70.70
Part- (Shi et al., 2020) PAMI 2020 78.49 87.81 73.51
3DSSD (Yang et al., 2020) CVPR 2020 79.57 88.36 74.55
STD (Yang et al., 2019) ICCV 2019 79.71 87.95 75.09
IA-SSD (Zhang et al., 2022) CVPR 2022 80.32 88.87 75.10
PV-RCNN (Shi et al., 2020) CVPR 2020 81.43 90.25 76.82
Voxel-Point (Li et al., 2021b) MM 2021 81.58 88.53 77.37
Voxel-RCNN (Deng et al., 2020) AAAI 2021 81.62 90.90 77.06
CT3D (Sheng et al., 2021) ICCV 2021 81.77 87.83 77.16
VoxSeT (Chenhang He and Zhang, 2022) CVPR 2022 82.06 88.53 77.46
PointPillars (Lang et al., 2019) CVPR 2019 74.99 79.05 68.30
SPNet-P - 77.83 85.84 72.84
Improvement - +2.84 +6.79 +4.54
*CenterPoint (Yin et al., 2021) CVPR 2021 73.96 81.17 69.48
SPNet-V - 78.93 86.87 73.64
Improvement - +4.97 +5.70 +4.16
*PVRCNN++ (Shi et al., 2021) Arxiv 2021 81.06 86.95 76.60
SPNet-PV - 82.11 88.53 77.41
Improvement - +1.05 +1.58 +0.81
Table 1. Performance comparison on the KITTI test benchmark for the car category with 40 recall points. * means results are reproduced by the public OpenPCDet (Team and others, 2020).

Implementation Details Our plug-and-play SPNet has three versions: (a) SPNet-P: is build upon the pillar-based method PointPillars (Lang et al., 2019), which is also anchor-based. (c) SPNet-V: is build upon the voxel-based method CenterPoint (Yin et al., 2021), which is also anchor-free. (c) SPNet-PV: is build upon the point-voxel-based method PV-RCNN++ (Shi et al., 2021), which leads the top performance on the KITTI benchmark. Since our experimental settings are strictly consistent with the baseline we used, we only introduce the details of SPNet-P here. In the distillation stage, the loss weights for class-wise passing , pixel-wise passing , instance-wise passing are set to 0.1, 10 and 10, respectively. In instance-wise passing module, the and are set to 2 and 0.1, respectively. In the 3D detection loss, the , and

are set to 1, 0.2 and 2, respectively. We train the student network from scratch in an end-to-end manner with ADAM optimizer for 80 epochs on 4 V100 GPUs. The learning rate is set to 0.003 with a poly rate using power as 0.9 and a weight decay of 0.9. The batch size per GPU is set to 4. For KITTI dataset, the x, y, z range of [(0, 69.12), (-39.68, 39.68), (-3, 1)] meters respectively, which is voxelized with the voxel size (0.16m, 0.16m, 4m) in each axis. During training, we adopt the common used data augmentation strategies for 3D object detection, including random world flip along the X axis, random world rotation around the Z axis with a random angle sampled from [-

, ], random world scaling with a random scaling ratio sampled from [0.95, 1.05]. Besides, we conduct the ground-truth sampling augmentation used in SECOND (Yan et al., 2018) to crop several new boxes from ground truth and corresponding points in other frames and paste them in the training scenes.

Method Reference 3D Detection(%)
Mod. Easy Hard
LiDAR + RGB:
MV3D (Chen et al., 2017) CVPR 2017 62.28 71.29 56.56
F-PointNet (Qi et al., 2018) CVPR 2018 70.92 83.76 63.65
3D-CVF (Yoo et al., 2020) ECCV 2020 79.88 89.67 78.47
LiDAR only:
SECOND (Yan et al., 2018) Sensor 2018 76.48 87.43 69.10
TANet (Liu et al., 2020) AAAI 2020 76.64 87.52 73.86
PointRCNN (Shi et al., 2019) CVPR 2019 78.63 88.88 77.38
3DSSD (Yang et al., 2020) CVPR 2020 79.45 89.71 78.67
IA-SSD (Zhang et al., 2022) CVPR 2022 79.57 - -
PV-RCNN (Shi et al., 2020) CVPR 2020 83.69 89.35 78.70
VoTR-TSD (Mao et al., 2021b) ICCV 2021 84.04 89.04 78.68
Pyramid-PV (Mao et al., 2021a) ICCV 2021 84.38 89.37 78.84
Voxel-RCNN (Deng et al., 2020) AAAI 2021 84.52 89.41 78.93
PointPillars (Lang et al., 2019) CVPR 2019 77.31 87.29 75.55
SPNet-P - 78.67 88.71 77.29
Improvement - +1.36 +1.42 +1.74
*CenterPoint (Yin et al., 2021) CVPR 2021 76.98 86.76 74.54
SPNet-V - 78.32 87.88 77.24
Improvement - +1.34 +1.12 +2.70
*PVRCNN++ (Shi et al., 2021) Arxiv 2021 83.84 89.34 78.81
SPNet-PV - 84.92 89.26 78.92
Improvement - +1.08 -0.08 +0.11
Table 2. Performance comparison on the KITTI validation set for the car category with 11 recall points. * means results are reproduced by the public OpenPCDet (Team and others, 2020).
Method Pedestrian Cyclist
Mod. Easy Hard Mod. Easy Hard
Second (Yan et al., 2018) 51.84 57.02 47.38 65.21 82.00 61.35
PointRCNN (Shi et al., 2019) 58.32 63.29 51.59 66.67 83.68 61.92
VoxelNet (Zhou and Tuzel, 2018) 59.84 67.79 54.38 64.89 84.92 58.59
PointPillars (Lang et al., 2019) 54.98 59.42 49.86 65.19 81.73 62.20
SPNet-P 56.46 61.07 51.19 68.03 82.10 63.00
Improvement +1.48 +1.65 +1.33 +2.84 +0.37 +0.80
Table 3. Performance comparison on the KITTI validation set for the pedestrian and cyclist category with 11 recall points. The results are evaluated by average precision at IoU = 0.5.
Group Class Feature Prediction 3D Detection (%)
wise wise wise Mod. Easy Hard
I 77.31 87.29 75.55
II 78.23 88.55 76.82
III 78.47 88.29 77.12
IV 77.74 86.99 76.47
V 78.67 88.71 77.29
Table 4. Effectiveness of each individual component in the SPNet for the car class at IOU=0.7 (R11).

4.2. Comparison to State-of-the-Art

Results on the KITTI test set We first compare our SPNet with other state-of-the-art approaches on the KITTI test benchmark for the commonly used car category using the default IOU of 0.7. As shown in Table 1, our method of different versions SPNet-P, SPNet-V and SPNet-PV outperform the baseline methods consistently with large margins of 2.84%, 4.97% and 1.05% mAP increase on the moderate level. Note that our SPNet-PV achieves the new state-of-the-art performance among all competitors with mAP of 82.11%. In specific, compared with the strong competitor CLOCs (Pang et al., 2020), which combines the RGB images and the point clouds to boost the performance, our SPNet-PV achieves a significant improvement of 1.44% on moderate AP. It is worth noting that CLOCs (Pang et al., 2020) builds its best model based on the baseline PV-RCNN (Shi et al., 2020) but still gets inferior results to us. Even compared with the very recent public method VoxSet (Chenhang He and Zhang, 2022), which leads the top performance on the KITTI benchmark, our method can still achieve better performance. Since the proposed SPNet needs no extra computational cost during the inference time, our different variants of SPNet can achieve the same inference speed as the baseline models.

Results on the KITTI val set We evaluate the proposed framework of three versions with several state-of-the-art methods on the KITTI validation set. As shown in Table 2, the results are calculated by recall 11 positions with the IoU threshold 0.7 for a fair comparison. Our SPNet-P, SPNet-V and SPNet-PV improve the baseline by a large margin, with 1.36%/1.34%/1.08% on the moderate level and 1.74%/2.70%/0.11% on the hard level. The significant performance gains on the hard samples illustrate the importance of exploiting context information for detecting 3D objects with only a few points. Our approach SPNet-PV achieves the best performance among all competitors, pushing the moderate car AP to 84.92%. The observations show that the AP gains on the KITTI test set are more significant than on the validation set. We argue that this results from the smaller domain gap between the train and validation set, which further illustrates the better generalization of our method.

Results on “Pedestrian” and “Cyclist” We also report the experimental results on small objects like pedestrians and cyclists, shown in Table 3. Following previous works (Shi et al., 2019; Zhou and Tuzel, 2018), the 3D detection results are evaluated IoU = 0.5 with 11 recall points. We adopt PointPillars (Lang et al., 2019) as the baseline and improve the detection accuracy of pedestrians and cyclists from 54.98%/65.19% moderate AP to 56.46%/68.03% on the moderate setting, respectively. Especially for the cyclist class, our SPNet-P achieves significantly better performance than all other methods with up to 1.36% and 1.08% 3D AP improvement on the moderate and hard levels.

4.3. Ablation Study

We conduct extensive ablation studies to comprehend the roles of different components in SPNet. All experiments in this section are conducted on SPNet-P, which utilizes PointPillars (Lang et al., 2019)

as the baseline. We mainly focus on the car category and use IOU = 0.7 with 11 recall points as the evaluation metric.

The effects of different components The SPNet consists of three standalone information passing modules: class-wise passing, pixel-wise passing and instance-wise passing. From the results shown in Table 4, we can find the final performance has been improved with the addition of any information passing module, with 0.92%/1.16%/0.43% on the moderate setting, respectively. Adding the three components together further boosts performance, proving that our modules can complement each other.

Method 3D Detection(%) BEV Detection(%)
Mod. Easy Hard Mod. Easy Hard
Baseline 77.31 87.29 75.55 87.60 89.82 85.71
Direct transfer 78.38 88.48 77.08 88.07 90.38 86.04
Affine map 78.57 88.80 77.22 88.25 90.23 86.64
Class cluster 78.67 88.71 77.29 88.29 90.42 86.41
Table 5. Evaluation on the loss design of class-wise passing module. The results are for the car class at IOU=0.7 (R11).
Method 3D Detection(%) BEV Detection(%)
Mod. Easy Hard Mod. Easy Hard
Baseline 77.31 87.29 75.55 87.60 89.82 85.71
L1 loss 78.54 88.53 77.15 88.24 90.36 86.73
KLD loss 78.64 88.74 77.25 88.26 90.37 86.73
L2 loss 78.67 88.71 77.29 88.29 90.42 86.41
Table 6. Evaluation on the loss design of pixel-wise passing module. The results are for the car class at IOU=0.7 (R11).
Method 3D Detection(%) BEV Detection(%)
Mod. Easy Hard Mod. Easy Hard
Baseline 77.31 87.29 75.55 87.60 89.82 85.71
L1 loss 78.67 88.63 77.22 88.23 90.41 86.95
L2 loss 78.58 88.04 77.20 88.26 90.19 86.88
KLD loss 78.67 88.71 77.29 88.29 90.42 86.41
Table 7. Evaluation on the loss design of instance-wise passing module. The results are for the car class at IOU=0.7 (R11).
Figure 3. Visualization of the 3D detection results and corresponding 2D feature maps at prediction level on KITTI validation split set. The predicted 3D boxes of the baseline model, the teacher and the student models are shown in red, green and orange, respectively, while the ground-truth boxes are in blue. The blue dashed boxes highlight the regions produce less false-positive and more accurate box predictions after applying our SPNet during training.

Class-wise passing module We investigate different distillation strategies used in class-wise passing module in Table 5. We define some variants of the modules based on different styles of map generation: (a) Direct transfer: directly computing L2 loss between the feature maps; (b) Affine map: compute the affinity map of the feature maps from teacher and student model and calculate the L2 loss between the affinity maps; (c) Class cluster: our design. Experimental results show that Direct transfer can improve the detection performance over the baseline, with 1.24%/1.23%/1.60% improvement in the 3D detection. At the same time, the Affine map brings 1.26% improvement over the baseline on the moderate level. Our Class cluster achieves the best result both on 3D detection and BEV detection, which demonstrates the effectiveness of the module.

Pixel-wise passing module Table 6 illustrates the impact of different losses used in the pixel-wise passing module on the performance of the proposed model. We find that utilizing L1 loss, KLD loss and L2 loss can improve the baseline from 77.31% to 78.54%, 78.64% and 78.67%, respectively. Three losses achieve comparable results, and we choose the best L2 loss as the distillation loss for the pixel-wise passing module.

Instance-wise passing module Table 7

illustrates the impact of different losses used in the instance-wise passing module on the performance of the proposed model. Similar to the pixel-wise passing module, adding one of the L1 loss, L2 loss, or KLD loss can boost the performance of the baselines, and the improvements are comparable. However, we find that utilizing KLD loss here achieves the best performance. We claim that this results from the feature gaps between the middle and final output levels, and Kullback-Leibler divergence is more suitable for calculating the probability distribution distance for output instances. So we choose KLD loss to guide the learning of instance-wise passing module.

Method 3D Detection(%) BEV Detection(%)
Mod. Easy Hard Mod. Easy Hard
Baseline 77.31 87.29 75.55 87.60 89.82 85.71
One-hot Encoding 87.13 89.94 80.04 90.21 90.74 90.16
Categorical Encoding 87.42 89.76 79.98 90.19 90.62 90.20
Table 8. Evaluation on the encoding method design in the GT-Painting. The results are for the car class at IOU=0.7 (R11).

Encoding method for GT-Painting Table 8 illustrates the impact of using different encoding methods for GT-Painting. We define two variants: (a) One-hot Encoding: the class information is encoded into a one-hot vector with the shape of class numbers. For example, if a point belongs to a car, it will be appended with . (b) Categorical Encoding: we use a scalar to indicate which category the point belongs to. The experiments show that both methods can improve the baseline by a large margin and comparable improvements. This demonstrates the importance of class-wise information whose contribution is not limited by the representative form. In order to reduce the computational complexity of the network, we choose Categorical Encoding in this paper.

Visualization Qualitative results are provided in Figure 3. We visualize the 3D objection detection results and the corresponding feature maps of three types of models: baseline, teacher and student, where we use PointPillars (Lang et al., 2019) as the baseline here. Since the feature maps of the resulting space can better reflect the discrepancy between different models, we only present the feature maps taken from the instance-wise passing module. Obviously, the baseline will produce many false predictions, which illustrates that merely applying supervision in the label space can not fully exploit the semantic information contained in the objects. On the contrary, the teacher model can take advantage of the input semantics and produce more accurate results, then pass the semantic information to the student, achieving better results. The differences in the feature map further reflect this problem and simultaneously illustrate the necessity of our transfer semantics.

5. Conclusion

In this paper, we propose a novel Semantic Passing Network (SPNet) for 3D object detection. Our method can take full of the semantic information in the point cloud labels and distill the instructive knowledge to the student network during training. Benefited from this design, we can improve the existing 3D detectors by a large margin without any inference computational cost. Note the proposed SPNet achieves state-of-the-art 3D detection performance on the KITTI benchmark, which demonstrates the effectiveness of the proposed method.

Limitation So far, our work only considers the distillation point in the one-stage network or the first stage of the two-stage network, ignoring the second refinement stage. We plan to explore the influence of designing distillation losses in the second stage on the final performance for future work. Besides, we will apply our SPNet to more 3D object detectors to boost the localization performance.

References

  • L. Breiman and N. Shang (1996) Born again trees. University of California, Berkeley, Berkeley, CA, Technical Report 1 (2), pp. 4. Cited by: §2.2.
  • H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    ,
    pp. 11621–11631. Cited by: §1.
  • C. Chen, Z. Chen, J. Zhang, and D. Tao (2022) SASA: semantics-augmented set abstraction for point-based 3d object detection. arXiv preprint arXiv:2201.01976. Cited by: §2.1.
  • X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §4.1, Table 1, Table 2.
  • Z. Chen, X. Ye, W. Yang, Z. Xu, X. Tan, Z. Zou, E. Ding, X. Zhang, and L. Huang (2021) Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15529–15538. Cited by: §2.2.
  • S. L. Chenhang He and L. Zhang (2022) Voxel set transformer: a set-to-set approach to 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2, Table 1.
  • X. Dai, Z. Jiang, Z. Wu, Y. Bao, Z. Wang, S. Liu, and E. Zhou (2021) General instance distillation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7842–7851. Cited by: §2.2.
  • J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li (2020) Voxel r-cnn: towards high performance voxel-based 3d object detection. arXiv preprint arXiv:2012.15712 1 (2), pp. 4. Cited by: §2.1, Table 1, Table 2.
  • M. Dunnhofer, N. Martinel, and C. Micheloni (2021) Weakly-supervised domain adaptation of deep regression trackers via reinforced knowledge distillation. IEEE Robotics and Automation Letters 6 (3), pp. 5016–5023. Cited by: §2.2.
  • T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhadran (2017) Efficient knowledge distillation from an ensemble of teachers.. In Interspeech, pp. 3697–3701. Cited by: §2.2.
  • A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §1, §4.1.
  • J. Guo, K. Han, Y. Wang, H. Wu, X. Chen, C. Xu, and C. Xu (2021) Distilling object detectors via decoupled features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2154–2164. Cited by: §2.2, §3.2.
  • M. Hao, Y. Liu, X. Zhang, and J. Sun (2020) LabelEnc: a new intermediate supervision method for object detection. In European Conference on Computer Vision, pp. 529–545. Cited by: §2.2.
  • G. Hinton, O. Vinyals, J. Dean, et al. (2015)

    Distilling the knowledge in a neural network

    .
    arXiv preprint arXiv:1503.02531 2 (7). Cited by: §2.2.
  • J. Hu, C. Fan, H. Jiang, X. Guo, Y. Gao, X. Lu, and T. L. Lam (2021) Boosting light-weight depth estimation via knowledge distillation. arXiv preprint arXiv:2105.06143. Cited by: §2.2.
  • J. Jiao, Y. Wei, Z. Jie, H. Shi, R. W. Lau, and T. S. Huang (2019) Geometry-aware distillation for indoor semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2869–2878. Cited by: §2.2.
  • A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) Pointpillars: fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697–12705. Cited by: Figure 1, §1, §3.1, §3.3, §4.1, §4.2, §4.3, §4.3, Table 1, Table 2, Table 3.
  • J. Li, H. Dai, L. Shao, and Y. Ding (2021a) Anchor-free 3d single stage detector with mask-guided attention for point cloud. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 553–562. Cited by: §1, §2.1.
  • J. Li, H. Dai, L. Shao, and Y. Ding (2021b) From voxel to point: iou-guided 3d object detection for point cloud with voxel-to-point decoder. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 4622–4631. Cited by: §1, §2.1, Table 1.
  • M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. Cited by: Table 1.
  • Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang (2019a) Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2604–2613. Cited by: §2.2.
  • Y. Liu, X. Dong, X. Lu, F. S. Khan, J. Shen, and S. Hoi (2019b) Teacher-students knowledge distillation for siamese trackers. arXiv preprint arXiv:1907.10586. Cited by: §2.2.
  • Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai (2020) Tanet: robust 3d object detection from point clouds with triple attention. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 11677–11684. Cited by: Table 2.
  • J. Mao, M. Niu, H. Bai, X. Liang, H. Xu, and C. Xu (2021a) Pyramid r-cnn: towards better performance and adaptability for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2723–2732. Cited by: Table 2.
  • J. Mao, Y. Xue, M. Niu, H. Bai, J. Feng, X. Liang, H. Xu, and C. Xu (2021b) Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173. Cited by: §1, §2.1, Table 2.
  • J. Noh, S. Lee, and B. Ham (2021) Hvpr: hybrid voxel-point representation for single-stage 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14605–14614. Cited by: §2.1.
  • X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang (2021) 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7463–7472. Cited by: §2.1.
  • S. Pang, D. Morris, and H. Radha (2020) CLOCs: camera-lidar object candidates fusion for 3d object detection. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10386–10393. Cited by: §4.2, Table 1.
  • C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 918–927. Cited by: §2.1, Table 1, Table 2.
  • H. Sheng, S. Cai, Y. Liu, B. Deng, J. Huang, X. Hua, and M. Zhao (2021) Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752. Cited by: §2.1, Table 1.
  • S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li (2020) PV-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, §4.2, Table 1, Table 2.
  • S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and H. Li (2021) PV-rcnn++: point-voxel feature set abstraction with local vector representation for 3d object detection. arXiv preprint arXiv:2102.00463. Cited by: §2.1, §4.1, Table 1, Table 2.
  • S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 770–779. Cited by: §2.1, §4.2, Table 1, Table 2, Table 3.
  • S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li (2020) From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence 43 (8), pp. 2647–2664. Cited by: Table 1.
  • W. Shi and R. Rajkumar (2020) Point-gnn: graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1711–1719. Cited by: §2.1.
  • C. Shu, Y. Liu, J. Gao, L. Xu, and C. Shen (2020) Channel-wise distillation for semantic segmentation. arXiv e-prints, pp. arXiv–2011. Cited by: §2.2.
  • K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    .
    In CVPR, Cited by: §3.1.
  • O. Team et al. (2020)

    Openpcdet: an open-source toolbox for 3d object detection from point clouds

    .
    Cited by: Table 1, Table 2.
  • S. Vora, A. H. Lang, B. Helou, and O. Beijbom (2020) Pointpainting: sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4604–4612. Cited by: §1, §3.1, Table 1.
  • C. Wang, C. Ma, M. Zhu, and X. Yang (2021) Pointaugmenting: cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11794–11803. Cited by: §1.
  • T. Wang, L. Yuan, X. Zhang, and J. Feng (2019) Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4933–4942. Cited by: §3.2.
  • Y. Wang, A. Fathi, J. Wu, T. Funkhouser, and J. Solomon (2020) Multi-frame to single-frame: knowledge distillation for 3d object detection. arXiv preprint arXiv:2009.11859. Cited by: §2.2.
  • Z. Wang and K. Jia (2019) Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749. Cited by: Table 1.
  • D. Xu, D. Anguelov, and A. Jain (2018) Pointfusion: deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 244–253. Cited by: §1.
  • Q. Xu, Y. Zhong, and U. Neumann (2021) Behind the curtain: learning occluded shapes for 3d object detection. arXiv preprint arXiv:2112.02205. Cited by: §2.1.
  • Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §4.1, Table 1, Table 2, Table 3.
  • Z. Yang, Y. Sun, S. Liu, and J. Jia (2020) 3dssd: point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11040–11048. Cited by: Table 1, Table 2.
  • Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) Std: sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1951–1960. Cited by: §2.1, Table 1.
  • W. Yi, W. Zibu, R. Yongming, L. Jiaxin, Z. Jie, and L. Jiwen (2022) LiDAR distillation: bridging the beam-induced domain gap for 3d object detection. arXiv preprint arXiv:2203.14956. Cited by: §2.2.
  • T. Yin, X. Zhou, and P. Krahenbuhl (2021) Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11784–11793. Cited by: §2.1, §4.1, Table 1, Table 2.
  • T. Yin, X. Zhou, and P. Krähenbühl (2021) Multimodal virtual point 3d detection. Advances in Neural Information Processing Systems 34. Cited by: §1.
  • J. H. Yoo, Y. Kim, J. Kim, and J. W. Choi (2020) 3d-cvf: generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In European Conference on Computer Vision, pp. 720–736. Cited by: Table 1, Table 2.
  • Y. Yuan, X. Chen, and J. Wang (2020) Object-contextual representations for semantic segmentation. Cited by: §3.1.
  • Y. Zhang, Q. Hu, G. Xu, Y. Ma, J. Wan, and Y. Guo (2022) Not all points are equal: learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 1, Table 2.
  • Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4490–4499. Cited by: §2.1, §4.2, Table 1, Table 3.
  • Z. Zhou, L. Du, X. Ye, Z. Zou, X. Tan, E. Ding, L. Zhang, X. Xue, and J. Feng (2021) SGM3D: stereo guided monocular 3d object detection. arXiv preprint arXiv:2112.01914. Cited by: §3.2.
  • X. Zhu, Y. Ma, T. Wang, Y. Xu, J. Shi, and D. Lin (2020) Ssn: shape signature networks for multi-class object detection from point clouds. In European Conference on Computer Vision, pp. 581–597. Cited by: §2.1.