3D object detection is becoming an active research topic in both computer vision and computer graphics. Compared to 2D object detection in RGB images, predicting 3D bounding boxes in real world environments captured by point clouds is more essential for many tasks[song2016deep] such as indoor robot navigation [mccormac2018fusion++], robot grasping [wang2019densefusion]
, etc. However, the unstructured data in point clouds makes the detection more challenging than in 2D. In particular, the popular convolutional neural networks (CNNs), which are highly successful in 2D object detection, are difficult to be applied to point clouds directly.
Growing interests have been attracted to tackle this challenge. With the emergence of deep 3D points processing networks, such as [qi2017pointnet, qi2017pointnet++]
, several deep learning based 3D object detection works have been proposed recently to detect objects directly from 3D point clouds[hou20193d, qi2019deep]. The most recent work VoteNet [qi2019deep]
proposed an end-to-end 3D object detection network on the basis of Hough voting. VoteNet transfers the traditional Hough voting procedure into a regression problem implemented by a deep network, and samples a number of seed points from the input point cloud to generate patches voting for potential object centers. The voted centers are then used to estimate the 3D bounding boxes. The voting strategy enables VoteNet to significantly reduce the searching space and achieve the state-of-the-art results in several benchmark datasets. However, treating every point patch and object individually, VoteNet lacks the consideration of the relationships between different objects and between objects and the scene they belong to, which limits its detection accuracy.
An example can be seen in Fig. 1. Point clouds, captured by e.g. depth cameras, often contain noisy and missing data. This together with indoor occlusions makes it difficult even for humans to recognize what and where an object is in Fig. 1(a). Nevertheless, considering the surrounding contextual information in Figs. 1(b-d), it is much easier to recognize it is a chair given the surrounding chairs and the table in the dining room scene. Actually, the representation of a scanned point set could be ambiguous when it is presented individually, due to lack of color appearance and data missing problems. Therefore, we argue that indoor depth scans are often so occluded that contexts could even play a more important role in recognizing objects than the point data itself. This contextual information has been demonstrated to be helpful in a variety of computer vision tasks, including object detection [hu2018relation, yu2016role], image semantic segmentation [zhang2019co, fu2019dual] and 3D scene understanding [zhang2014panocontext, zhang2017deepcontext]. In this paper, we show how to leverage the contextual information in 3D scenes to boost the performance of 3D object detection from point clouds.
In our view, contextual information for 3D object detection consists of multiple levels. At the lowest is the patch level where the data missing problem is mitigated with a weighted sum over similar point patches to assist more accurate voting of object centers. At the object level, coexistence of objects provides strong hints on detection of certain objects. For example, as shown in Fig. 1(d), the detected table can give a tendency for chairs to be detected at surrounding points. At the scene level, global scene clues can also prevent an object from being detected in an improper scene. For example, we will not expect to detect a bed in a kitchen. The contexts at different levels complement each other and are utilized together to assist the correct inference of objects in noisy and cluttered environments.
We thus propose a novel 3D object detection framework, called Multi-Level Context VoteNet (MLCVNet), to incorporate into VoteNet multi-level contextual information for 3D object detection. Specifically, we propose a unified network to model the multi-level contexts, from local point patches to global scenes. The difference between VoteNet and the proposed network is highlighted in Fig. 2. To model the contextual information, three sub-modules are proposed in the framework, i.e., patch-to-patch context (PPC) module, object-to-object context (OOC) module and the global scene context (GSC) module. In particular, similar to [zhang2019pcan], we use the self-attention mechanism to model the relationships between elements in both PPC and OOC modules. These two sub-modules aim at adaptively encoding contextual information at the patch and object levels, respectively. For the scene-level, we design a new branch as shown in Fig. 2(c) to fuse multi-scale features to equip the network with the ability of learning global scene context. In summary, the contributions of this paper include:
We propose the first 3D object detection network that exploits multi-level contextual information at patch, object and global scene levels.
We design three sub-modules, including two self-attention modules and a multi-scale feature fusion module, to capture the contextual information at multiple levels in 3D object detection. The new modules nicely fit in the state-of-the-art VoteNet framework. Ablation study demonstrates the effectiveness of these modules in improving detection accuracy.
Extensive experiments demonstrate the benefits of multi-level contextual information. The proposed network outperforms state-of-the-art methods on both SUN RGB-D and ScanNetV2 datasets.
2 Related Work
2.1 3D Object Detection From Point Clouds
Object detection from 2D images has been studied for decades. Since the development of deep convolutional neural networks (DCNNs) [krizhevsky2012imagenet], both the accuracy and efficiency of 2D object detection have been significantly improved by deep learning techniques [girshick2015fast, ren2015faster]. Compared to 2D, 3D object detection was dominated by non-deep learning based methods [nan2012search, li2015database, wang2016cluttered] until the recent couple of years. With the development of deep learning on 3D point clouds [wang2017cnn, li2018pointcnn, atzmon2018point], many deep learning based 3D object detection architectures have emerged [chen2016monocular, chen2017multi, lahoud20172d]. However, most of these methods depend on using 2D detectors as an intermediate step, which restricts their generalization to situations where 2D detectors do not work well [qi2018frustum]. To address this issue, several deep learning based 3D detectors which directly take raw point clouds as input have been proposed recently [zhou2018voxelnet, yang2019learning, hou20193d]. In [shi2019pointrcnn], the authors introduced a two-stage 3D object detector, PointRCNN. Their method first generates several 3D bounding box proposals, and then refines these proposals to obtain the final detection results. Instead of directly treating 3D object proposal generation as a bounding box regression problem, in [yi2019gspn], a novel 3D object proposal approach was proposed by taking an analysis-by-synthesis strategy and reconstructing 3D shapes from point clouds. Inspired by the Hough voting strategy for 2D object detection in [leibe2004combined], the work in [qi2019deep] presents an end-to-end trainable 3D object detection network, which directly deals with 3D point clouds, by virtue of the huge success in PointNet/PointNet++ [qi2017pointnet, qi2017pointnet++]. Although a lot of methods have been proposed recently, there is still large room for improvement especially for real-world challenging cases. Previous works largely ignored contextual information, i.e., relationships within and between objects and scenes. In this work, we show how to leverage the contextual information to improve the accuracy of 3D object detection.
2.2 Contextual Information
The work in [mottaghi2014role] has demonstrated that contextual information has significant positive effect on 2D semantic segmentation and object detection. Since then, contextual information has been successfully employed to improve performance on many tasks such as 2D object detection [yu2016role, hu2018relation, liu2018structure], 3D point matching [deng2018ppfnet], point cloud semantic segmentation [engelmann2017exploring, ye20183d], and 3D scene understanding [zhang2014panocontext, zhang2017deepcontext]. The work in [hu2018semantic] achieves reasonable results on instance segmentation of 3D point clouds via analyzing point patch context. In [shi2019hierarchy]
, a recursive auto-encoder based approach is proposed to predict 3D object detection via exploring hierarchical context priors in 3D object layout. Inspired by the self-attention idea in natural language processing[vaswani2017attention], recent works connect the self-attention mechanism with contextual information mining to improve scene understanding tasks such as image recognition [hu2018squeeze], semantic segmentation [fu2019dual] and point cloud recognition [xie2018attentional]. As to 3D point data processing, the work in [zhang2019pcan] proposes to utilize the attention network to capture the contextual information in 3D points. Specifically, it presents a point contextual attention network to encode local features into a global descriptor for point cloud based retrieval. In [paigwar2019attentional], an attentional PointNet is proposed to search regions of interest instead of processing the whole input point cloud, when detecting 3D objects in large-scale point clouds. Different from previous works, we are interested in exploiting the combination of multi-level contextual information for 3D object detection from point clouds. In particular, we integrate two self-attention modules and one multi-scale feature fusion module into a deep Hough voting network to learn multi-level contextual relationships between patches, objects and the global scene.
As shown in Fig. 3, our MLCVNet contains four main components: a fundamental 3D object detection framework based on VoteNet which follows the architecture in [qi2019deep], and three context encoding modules. The PPC (patch-patch context) module combines the point groups to encode the patch correlation information, which helps to vote for more accurate object centers. The OOC (object-object context) module is for capturing the contextual information between object candidates. This module helps to improve the results of 3D bounding box regression and classification. The GSC (global scene context) module is to integrate the global scene contextual information. In brief, the proposed three sub-modules are designed to capture complementary contextual information in 3D object detection at multiple levels, with the aim to improve the detection performance in 3D point clouds.
VoteNet [qi2019deep] is the baseline of our work.
As illustrated in Fig. 2, it is an end-to-end trainable 3D object detection network consisting of three main blocks: point feature extraction
point feature extraction, voting, and object proposal and classification.
To extract point features, PointNet++ is used as the backbone network for seed sampling and extracting high dimensional features for the seed points from the raw input point cloud. The features of each seed point contain information from its surrounding points within a radius as illustrated in Fig. 4(a). Analogous to regional patches in 2D, we thus call these seed points point patches
in the remaining of this paper. The voting block takes the point patches with extracted features as input and regresses object centers. This center point prediction is performed by a multi-layer perceptron (MLP) which simulates the Hough voting procedure. Clusters are then generated by grouping the predicted centers, and form object candidates, from which the 3D bounding boxes are then proposed and classified through another MLP layer.
Note that in VoteNet, both the point patches and the object candidates are processed independently, ignoring the surrounding patches or objects. However, we argue that relationships between these elements (i.e., point patches and object candidates) are useful information for object detection. Thus, we introduce our MLCVNet to encode these relationships. Our detection network follows the general framework of VoteNet, but integrates three new sub-modules to capture multi-level contextual information.
3.2 PPC Module
We consider relationships between point patches as the first level of context, i.e., patch-patch context (PPC), as shown in Fig. 4(a). At this level, contextual information between point patches, on the one hand, helps relieve the data missing problem via gathering supplementary information from similar patches. On the other hand, it considers inter-relationships between patches for voting [wang2013learning] by aggregating voting information from both the current point patch and all the other patches. We thus propose a sub-network, PPC module, to capture the relationships between point patches. For each point patch, the basic idea is to employ a self-attention module to aggregate information from all the other patches before sending it to the voting stage.
As shown in Fig. 4(a), after feature extraction using PointNet++, we get a feature map , where is the number of point patches sampled from the raw point cloud, and
is the dimension of the feature vector. We intend to generate a new feature mapthat encodes the correlation between any two point patches, and it can be formulated as the non-local operation:
where are three different transform functions, and encodes the similarities between any two positions of the input feature. Moreover, as shown in [hu2018squeeze], channel correlations in the feature map also contribute to the contextual information modeling in object detection tasks, we thus make use of the compact generalized non-local network (CGNL) [yue2018compact] as the attention module to explicitly model rich correlations between any pair of point patches and of any channels in the feature space. CGNL requires light computation and little additional parameters, making it more practically applicable. After the attention module, each row in the new feature map still corresponds to a point patch, but contains not only its own local features, but also the information associated with all the other point patches.
The effectiveness of the PPC module is visualized in Fig. 4(b). As shown, with the PPC module, the voted centers are more meaningful with more of them appearing on objects rather than on non-object regions. Moreover, the voted centers are more closely clustered compared to those without the module. The results demonstrate that our self-attention based weighted fusion over local point patches can enhance the performance of voting for object centers.
3.3 OOC Module
Most existing object detection frameworks detect each object individually. VoteNet is no exception, where each cluster is independently fed into the MLP layer to regress its object class and bounding box. However, combining features from other objects gives more information on the object relationships, which has been demonstrated to be helpful in image object detection [chen2018context]. Intuitively, objects will get weighted messages from those highly correlated objects. In such a way, the final predicted object result is not only determined by its own individual feature vector but also affected by object relationships. We thus regard the relationships between objects as the second level contextual information, i.e., object-object context (OOC).
We get a set of vote clusters after grouping the voted centers. is the number of generated clusters in this work. Each cluster
is fed into an MLP followed by a max pooling to form a single vector representing the cluster. Hererepresents the -th vote in , and is the number of votes in . Then comes the difference from VoteNet. Instead of processing each cluster vector independently to generate a proposal and classification, we consider the relationships between objects. Specifically, we introduce a self-attention module before the proposal and classification step, as shown in Fig. 3 (the blue module). Fig. 5(a) shows the details inside the OOC module. Specifically, after max pooling, the cluster vectors are fed into the CGNL attention module to generate a new feature map to record the affinity between all clusters. The encoding of object relationships can be summarized as:
where is the enhanced feature vector in the new feature map , and is the CGNL attention mapping. By doing so, the contextual relationships between these clusters (objects) are encoded into the new feature map.
The effectiveness of the OOC module is visualized in Fig. 5(b). As shown, with the OOC module, there are fewer detected objects overlapping with each other, and the positions of the detected objects are more accurate.
3.4 GSC Module
The whole point cloud usually contains rich scene contextual information which can help enhance the object detection accuracy. For example, it would be highly possible that a chair rather than a toilet is identified when the whole scene is a dining room rather than a bathroom. Therefore, we regard the information about the whole scene as the third level context, i.e., global scene context (GSC). Inspired by the idea of scene context extraction in [liu2018structure], we propose the GSC module (the green module in Fig. 3) to leverage the global scene context information to improve feature representation for 3D bounding box proposal and object classification, without explicit supervision of scenes.
The GSC module is designed to capture the global scene contextual information by introducing a global scene feature extraction branch. Specifically, we create a new branch with the input from the patch and object levels, concatenating the features at layers before applying self attention in PPC and OOC. As shown in Fig. 6(a), at the two layers each row represents a point patch or an object candidate , where and are the numbers of the sampled point patches and clusters, respectively. Max-pooling is first applied to get two vectors (i.e., the patch vector and the cluster vector), combining information from all the point patches and object candidates. Following the idea of multi-scale feature fusion in the contextual modeling strategy of 2D detectors, these two vectors are then concatenated to form a global feature vector. An MLP layer is applied to further aggregate global information, and the output is subsequently expanded and combined with the output feature map of the OOC module. This multi-scale feature fusion procedure can be summarized as:
In this way, the inference of the final 3D bounding boxes and the object classes will consider the compatibility with the scene context, which makes the final prediction more reliable under the effect of global cues. As shown in Fig. 6(b), the GSC module effectively reduces false detection in the scene.
4 Results and Discussions
|MRCNN 2D-3D [he2017mask]||Geo+RGB||17.3||10.5|
|3D-SIS [hou20193d]||Geo only||25.4||14.6|
|VoteNet [qi2019deep]||Geo only||58.6||33.5|
We evaluate our approach on SUN RGB-D [song2015sun] and ScanNet [dai2017scannet] datasets. SUN RGB-D is a well-known public RGB-D image dataset of indoor scenes, consisting of 10,335 frames with 3D object bounding box annotations. Over 64,000 3D bounding boxes are given in the entire dataset. As described in [zhang2017deepcontext], these scenes were mostly taken from household environments with strong context. The occlusion problem is quite severe in SUN RGB-D dataset. Sometimes, it is even difficult for humans to recognize the objects in the scene when merely a 3D point cloud is given without any color information. Thus, it is a challenging dataset for 3D object detection.
ScanNet dataset contains 1513 scanned 3D indoor scenes with densely annotated meshes. The ground-truth 3D bounding boxes of objects are also provided. The completeness of scenes in ScanNet makes it an ideal dataset for training our network to learn the contextual information at multiple levels.
4.2 Training details
Our network is trained end-to-end using an Adam optimizer and batch size 8. The base learning rate is set to for ScanNet dataset and for SUN RGB-D dataset. The network is trained for epochs on both datasets. The learning rate decay steps are set to for ScanNet, for SUN RGB-D, and the decay rates are . Training the model until convergence on one RTX 2080 ti GPU takes around 4 hours on ScanNetV2 and 11 hours on SUN RGB-D. During training we found the mAP result fluctuates within a small range. Thus, the mAP results reported in the paper are the mean results over three runs.
For parameter size, we check the file sizes of the stored PyTorch models for both our method and VoteNet.. The model size of our network is, while VoteNet is . For training time, VoteNet takes around 40s for 1 epoch with batch size of 8, while ours is around 42s. For inference time, we infer detection for 1 batch and measure the time. VoteNet takes around 0.13s, while ours is 0.14s. The times reported here are all tested on ScanNet dataset. These show that our method only slightly increases the complexity.
4.3 Comparisons with the State-of-the-art Methods
We first evaluate our method on SUN RGB-D dataset using the same 10 most common object categories as in [qi2019deep]. Table 1 gives a quantitative comparison of our method with deep sliding shapes (DSS) [song2016deep], cloud of gradients (COG) [ren2016three], 2D-driven [lahoud20172d], F-PointNet [qi2018frustum] and VoteNet [qi2019deep].
Remarkably, our method achieves better overall performance than all the other methods on SUN RGB-D dataset. The overall mAP (mean average precision) of MLCVNet reaches on SUN RGB-D validation set, higher than the current state-of-the-art, VoteNet. The heavy occlusion presented in SUN RGB-D dataset is a challenge for methods (e.g., VoteNet) that consider point patches individually. However, the utilization of contextual information in MLCVNet helps with the detection of occluded objects with missing parts, which we believe is the reason for the improved detection accuracy.
We also evaluate our MLCVNet against several more competing approaches, MRCNN 2D-3D [he2017mask], GSPN [yi2019gspn] and 3D-SIS [hou20193d], on ScanNet benchmark in Table 2. We report the detection results on both mAP and mAP. The mAP of MLCVNet on ScanNet validation set reaches making absolute points improvement over the best competitor VoteNet, and the mAP is even higher, making points improvement. The significant improvements confirm the effectiveness of our integration of multi-level contextual information. Table 3 shows the detailed results at mAP for each object category in ScanNetV2 dataset. As can be seen, for some specific categories, such as shower curtain and window, the improvements exceed 8 points. It is found that plane-like objects, such as door, window, picture and shower curtain, usually get higher improvements. The reason could be that these objects contain more similar point patches, which can be used by the attention module to complement each other to a great extent.
4.4 Ablation Study
To quantitatively evaluate the effectiveness of the proposed contextual sub-modules, we conduct experiments with different combinations of these modules. The quantitative results are shown in Table 4. The baseline method is the VoteNet. We then add the proposed sub-modules one by one into the baseline model. Applying the PPC module leads to improvements in mAP of and . The combination of PPC and OOC modules further improves the evaluation scores to and respectively. As expected, when equipped with all the three sub-modules, the mAP of our MLCVNet is boosted up to the highest scores on both datasets. It can be seen that contextual information captured by the designed sub-modules indeed brings notable improvements over the state-of-the-art method.
4.5 Qualitative Results
Fig. 7 shows qualitative comparison of the results using MLCVNet and VoteNet for 3D bounding box prediction on ScanNetV2 validation set. It is observed that the proposed MLCVNet detects more reasonable objects (red arrows), and predicts more precise boxes (blue arrows). The pink box produced by VoteNet is classified as a window, which is improper to overlap with a door, while our method ensures the compatibility between objects and scenes. The qualitative comparison results on SUN RGB-D are shown in Fig. 8. As shown, our model is still able to produce high-quality boxes even though the scenes are much occluded and less informative. As shown in the bedroom example in Fig. 8, there are overlaps and missing detection (red arrows) using VoteNet, while our model successfully detects all the objects with good precision compared to the ground-truth. For the second scene in Fig. 8, VoteNet misclassifies the table, produces overlaps, and predicts inaccurate boxes (red arrows), while our model produces much cleaner and more accurate results. However, it is worth noting that our method may still fail in some predictions, such as the overlapped windows in the red square in Fig. 7. Therefore, there is still room for improvements on 3D bounding box prediction when dealing with complicated scenes.
In this paper, we propose a novel network that integrates contextual information at multiple levels into 3D object detection. We make use of self-attention mechanism and multi-scale feature fusion to model the multi-level contextual information, and propose three sub-modules. The PPC module encodes the relationships between point patches, the OOC module captures the contextual information of object candidates, and the GSC module aggregates the global scene context. Ablation studies demonstrate the effectiveness of the proposed contextual sub-modules to improve the detection accuracy. Quantitative and qualitative experiments further demonstrate that our architecture successfully improves the performance of 3D object detection.
Future work. Contextual information analysis in 3D object detection still offers huge space for exploration. For example, to enhance the global scene context constraint, one possible way is to use the global feature in the GSC module to predict scene types as an auxiliary learning task, which can explicitly supervise the global feature representation. Another direction would be a more effective mechanism to encode the contextual information as in [hu2018relation].
This work was supported in part by National Natural Science Foundation of China under Grant (61772267, 61572507, 61532003, 61622212), the Fundamental Research Funds for the Central Universities under Grant NE2016004, the National Key Research and Development Program of China (No. 2018AAA0102200) and the Natural Science Foundation of Jiangsu Province under Grant BK20190016.