1 Introduction††*: equal contributions.†††: work done while at Facebook.
Recognition and localization of objects in a 3D environment is an important first step towards full scene understanding. Even such low dimensional scene representation can serve applications like autonomous navigation and augmented reality. Recently, with advances in deep networks for point cloud data, several works[voteNet, zhou2018voxelnet, shi2018pointrcnn] have shown state-of-the-art 3D detection results with point cloud as the only input. Among them, the recently proposed VoteNet [voteNet] work by Qi et al., taking 3D geometry input only, showed remarkable improvement for indoor object recognition compared with previous works that exploit all RGB-D channels. This leads to an interesting research question: Is 3D geometry data (point clouds) sufficient for 3D detection, or is there any way RGB images can further boost current detectors?
By examining the properties of point cloud data and RGB image data (see for example Fig. 1), we believe the answer is clear: RGB images have value in 3D object detection. In fact, images and point clouds provide complementary information. RGB images have higher resolution than depth images or LiDAR point clouds and contain rich textures that are not available in the point domain. Additionally, images can cover “blind regions” of active depth sensors which often occur due to reflective surfaces. On the other hand, images are limited in the 3D detection task as they lack absolute measures of object depth and scale, which are exactly what 3D point clouds can provide. These observations, strengthen our intuition that images can help point cloud-based 3D detection.
However, how to make effective use of 2D images in a 3D detection pipeline is still an open problem. A naïve way is to directly append raw RGB values to the point clouds – since the point-pixel correspondence can be established through projection. But since 3D points are much sparser, in doing so we will lose the dense patterns from the image domain. In light of this, more advanced ways to fuse 2D and 3D data have been proposed recently. One line of work [qi2018frustum, xu2018pointfusion, lahoud20172d]
uses mature 2D detectors to provide initial proposals in the form of frustums. This limits the 3D search space for estimating 3D bounding boxes. However, due to itscascaded design, it does not leverage 3D point clouds in the initial detection. In particular, if an object is missed in 2D, it will be missed in 3D as well. Another line of work [song2016deep, ku2018joint, wang2019densefusion, hou20193d] takes a more 3D-focused way to concatenate intermediate ConvNet features from 2D images to 3D voxels or points to enrich 3D features, before they are used for object proposal and box regression. The downside of such systems is that they do not use 2D images directly for localization, which can provide helpful guidance for detection objects in 3D.
In our work, we build upon the successful VoteNet architecture [voteNet] and design a joint 2D-3D voting scheme for 3D object detection named ImVoteNet. It takes advantage of the more mature 2D detectors [ren2015faster] but at the same time still reserves the ability to propose objects from the full point cloud itself – combining the best of both lines of work while avoiding the drawbacks of each. A key motivation for our design is to leverage both geometric and semantic/texture cues in 2D images (Fig. 1). The geometric cues come from accurate 2D bounding boxes in images, such as the output by a 2D detector. Instead of solely relying on the 2D detection for object proposal [qi2018frustum], we defer the proposal process to 3D. Given a 2D box, we generate 2D votes on the image space, where each vote connects from the object pixel to the 2D amodal box center. To pass the 2D votes to 3D, we lift
them by applying geometric transformations based on the camera intrinsic and pixel depth, so as to generate “pseudo” 3D votes. These pseudo 3D votes become extra features appended to seed points in 3D for object proposals. Besides geometric cues from the 2D votes, each pixel also passes semantic and texture cues to the 3D points, as either features extracted per-region, or ones extracted per-pixel.
After lifting and passing all the features from the images to 3D, we concatenate them with the 3D point features from a point cloud backbone network [qi2017pointnet, qi2017pointnetplusplus]. Next, following the VoteNet pipeline, those points with the fused 2D and 3D features generate 3D Hough votes [hough1959machine] – not limited by 2D boxes – toward object centers and aggregate the votes to produce the final object detections in 3D. As the seed features have both 2D and 3D information, they are intuitively more informative for recovering heavily truncated objects or objects with few points, as well as more confident in distinguishing geometrically similar objects.
In addition, we recognize that when fusing 2D and 3D sources, one has to carefully balance the information from two modalities to avoid one being dominated by the other. To this end, we further introduce a multi-towered network structure with gradient blending [wang2019makes] to ensure our network makes the best use of both the 2D and 3D features. During testing, only the main tower that operates on the joint 2D-3D features are used, minimizing the sacrifice on efficiency.
We evaluate ImVoteNet on the challenging SUN RGB-D dataset [song2015sun]. Our model achieves the state-of-the-art results while showing a significant improvement (+5.7 mAP) over the 3D geometry only VoteNet, validating the usefulness of image votes and 2D features. We also provide extensive ablation studies to demonstrate the importance of each individual component. Finally, we also explore the potential of using color to compensate for sparsity in depth points, especially for the case of lower quality depth sensors or for cases where depth is estimated from a moving monocular camera (SLAM), showing potential of our method to more broader use cases.
To summarize, the contributions of our work are:
A geometrically principled way to fuse 2D object detection cues into a point cloud based 3D detection pipeline.
The designed deep network ImVoteNet achieves state-of-the-art 3D object detection performance on SUN RGB-D.
Extensive analysis and visualization to understand various design choices of the system.
2 Related Work
Advances in 3D sensing devices have led to a surge of methods designed to identify and localize objects in a 3D scene. The most relevant lines of work are detection with point clouds and detection with full RGB-D data. We also briefly discuss a few additional relevant works in the area of multi-modal data fusion.
3D object detection with point clouds.
To locate objects using purely geometric information, one popular line of methods is based on template matching using a collection of clean CAD models either directly [li2015database, Indoor2012, litany2015asist], or through extracted features [song2014sliding, avetisyan2019scan2cad]. More recent methods are based on point cloud deep nets [qi2017pointnet, zhou2018voxelnet, lang2019pointpillars, shi2018pointrcnn, voteNet]. In the context of 3D scene understanding, there have also been promising results on semantic and instance segmentation [yi2018gspn, choy20194d, graham20183d]. Most relevant to our work are PointRCNN [shi2018pointrcnn] and Deep Hough Voting (VoteNet) [voteNet] which demonstrated state-of-the-art 3D object detection in outdoor and indoor scenes, respectively. Notably, these results are achieved without using the RGB input. To leverage this additional information, we propose a way to further boost detection performance in this work.
3D object detection with Rgb-D data.
Depth and color channels both contain useful information that can be useful for 3D object detection. Prior methods for fusing those two modalities broadly fall into three categories: 2D-driven, 3D-driven, and feature concatenation. The first type of method [lahoud20172d, qi2018frustum, deng2017amodal, xu2018pointfusion] starts with object detecions in the 2D image, which are then used to guide the search space in 3D. By 3D-driven, we refer to methods that first generate region proposals in 3D and then utilize 2D features to make a prediction, such as the Deep Sliding Shapes [song2016deep]. Recently more works focus on fusing 2D and 3D features earlier in the process such as Multi-modal Voxelnet [wang2019densefusion], AVOD [ku2018joint], multi-sensor [liang2018deep] and 3D-SIS [hou20193d]. However, all these mostly perform fusion through concatenation of 2D features to 3D features. Our proposed method is more closely related to the third type, but differs from it in two important aspects. First, we propose to make explicit use of geometric cues from the 2D detector and lift them to 3D in the form of pseudo 3D votes. Second, we use a multi-tower architecture [wang2019makes] to balance features from both modalities, instead of simply training on the concatenated features.
Multi-modal fusion in learning.
How to fuse signals from multiple modalities is an open research problem in other areas than 3D object detection. For example, the main focus of vision and language research is on developing more effective ways to jointly reason over visual data and texts [fukui2016multimodal, perez2018film, yu2018beyond] for tasks like visual question answering [antol2015vqa, johnson2017clevr]. Another active area of research is video+sound [owens2016visually, gao20192], where the additional sound track can either provide supervision signal [owens2016ambient], or propose interesting tasks to test joint understanding of both streams [zhao2018sound]. Targeting at all such tasks, a recent gradient blending approach [wang2019makes] is proposed to make the multi-modal network more robust (to over-fitting and different convergence rates), which is adopted in our approach too.
3 ImVoteNet Architecture
We design a 3D object detection solution suited for RGB-D scenes, based on the recently proposed deep Hough voting framework (VoteNet [voteNet]) by passing geometric and semantic/texture cues from 2D images to the voting process (as illustrated in Fig. 2). In this section, after a short summary of the original VoteNet pipeline, we describe how to build ‘2D votes’ with the assistance of 2D detectors on RGB and explain how the 2D information is lifted to 3D and passed to the point cloud to improve the 3D voting and proposal. Finally, we describe our multi-tower architecture for fusing 2D and 3D detection with gradient blending [wang2019makes]. More implementation details are provided in supplement.
3.1 Deep Hough Voting
VoteNet [voteNet] is a feed-forward network that consumes a 3D point cloud and outputs object proposals for 3D object detection. Inspired by the seminal work on the generalized Hough transform [ballard1981generalizing], VoteNet proposes an adaptation of the voting mechanism for object detection to a deep learning framework that is fully differentiable.
Specifically, it is comprised of a point cloud feature extraction module that enriches a subsampled set of scene points (called seeds) with high-dimensional features (bottom of Fig. 2 from input points to
seeds). These features are then pushed through a Multi-Layer-Perceptron (MLP) to generatevotes
. Every vote is both a point in the 3D space with its Euclidean coordinates (3-dim) supervised to be close to the object center, and a feature vector learned for the final detection task (F-dim). The votes form a clustered point cloud near object centers and are then processed by another point cloud network to generate object proposals and classification scores. This process is equivalent to the pipeline in Fig.2 with just the point tower and without the image detection and fusion.
VoteNet recently achieved state-of-the-art results on indoor 3D object detection in RGB-D [voteNet]. Yet, it is solely based on point cloud inputs and neglects the image channels which, as we show in this work, are a very useful source of information. In ImVoteNet, we leverage the additional image information and propose a lifting module from 2D votes to 3D that improves detection performance. Next, we explain how to get 2D votes in images and how we lift its geometric cues to 3D together with semantic/texture cues.
3.2 Image Votes from 2D Detection
We generate image votes based on a set of candidate boxes from 2D detectors. An image vote, in its geometric part, is simply a vector connecting an image pixel and the center of the 2D object bounding box that pixel belongs to (see Fig. 1). Each image vote is also augmented with its semantic and texture cues from the features of its source pixel, such that each image vote has dimension in total as in the fusion block in Fig. 2.
To form the set of boxes given an RGB image, we apply an off-the-shelf 2D detector (e.g. Faster R-CNN [ren2015faster]) pre-trained on color channels of the RGB-D dataset. The detector outputs the
most confident bounding boxes and their corresponding classes. We assign each pixel within a detected box a vote to the box center. Pixels inside multiple boxes are given multiple votes (corresponding 3D seed points are duplicated for each of them), and those outside of any box are padded with zeros. Next we go to details on how we derive geometric, semantic and texture cues.
Geometric cues: lifting image votes to 3D
The translational 2D votes provide useful geometric cues for 3D object localization. Given the camera matrix, the 2D object center in the image plane becomes a ray in 3D space connecting the 3D object center and the camera optical center (Fig. 1). Adding this information to a seed point can effectively narrow down the 3D search space of the object center to 1D.
In details, as shown in Fig. 3, given an object in 3D with its detected 2D bounding box in the image plane, we denote the 3D object center as and its projection onto the image as . A point on the object surface is associated with its projected point in the image place, hence knowing the 2D vote to the 2D object center , we can reduces the search space for the 3D center to a 1D position on the ray . We now derive the computation we follow to pass the ray information to the a 3D seed point. Defining in the camera coordinate, and , in the image plane coordinate, we seek to recover the 3D object center (voting target for the 3D point ). The true 3D vote from to is:
The 2D vote, assuming a simple pin-hole camera 111See supplementary for more details on how to deal with a general camera model and camera-to-world transformations. with focal length , can be written as:
We further assume the depth of the surface point is similar to the center point . This is a reasonable assumption for most objects when they are not too close to the camera. Then, given , we compute ,
which we refer to as a pseudo 3D vote, as lies on the ray and is in the proximity of . This pseudo 3D vote provides information about where the 3D center is relative to the point surface point .
To compensate for the error caused by the depth approximation (), we pass the ray direction as extra information to the 3D surface point. The error (along the -axis) caused by the approximated depth, after some derivation, can be expressed by
Hence, if we input the direction of the ray : , the network should have more information to estimate the true 3D vote by estimate the depth different . As we do not know the true 3D object center , we can use the ray direction of which aligns with after all, where
Normalizing and concatenating with the pseudo vote, the image geometric features we pass to the seed point are:
On top of the geometric features just discussed that just use the spatial coordinates of the bounding boxes, an important type of information RGB can provide is features that convey a semantic understanding of what’s inside the box. This information often complements what can be learned from 3D point clouds and can help to distinguish between classes that are geometrically very similar (such as table vs. desk or nightstand vs. dresser).
In light of this, we provide additional region-level features extracted per bounding box as semantic cues for 3D points. For all the 3D seed points that are projected within the 2D box, we pass a vector representing that box to the point. If a 3D seed point falls into more than one 2D boxes (i.e., when they overlap), we duplicate the seed point for each of the overlapping 2D regions (up to a maximum number of ). If a seed point is not projected to any 2D box, we simply pass an all-zero feature vector for padding.
It is important to note that the ‘region features’ here include but are not limited to features extracted from RoI pooling operations [ren2015faster]. In fact, we find representing each box with a simple one-hot class vector (with a confidence score for that class) is already sufficient to cover the semantic information needed for disambiguation in 3D. It not only gives a light-weight input (e.g. 10-dim [sung2015data] vs. 1024-dim [lin2017feature]) that performs well, but also generalizes to all other competitive (e.g. faster) 2D detectors [redmon2016you, liu2016ssd, lin2017focal] that do not explicitly use RoI but directly outputs classification scores. Therefore, we use this semantic cue by default.
Different from the depth information that spreads sparsely in the 3D space, RGB images can capture high-resolution signals at a dense, per-pixel level in 2D. While region features can offer a high-level, semantic-rich representation per bounding box, it is complementary and equally important to use the low-level, texture-rich representations as another type of cues. Such cues can be passed to the 3D seed points via a simple mapping: a seed point gets pixel features from the corresponding pixel of its 2D projection222 If the coordinates after projection is fractional, bi-linear interpolation is used.
If the coordinates after projection is fractional, bi-linear interpolation is used..
Although any learned, convolutional feature maps with spatial dimensions (height and width) can serve our purpose, by default we still use the simplest texture feature by feeding in the raw RGB pixel-values directly. Again, this choice is not only light-weight, but also makes our pipeline independent of 2D networks.
Experimentally, we show that even with such minimalist choice of both our semantic and texture cues, significant performance boost over geometric-only VoteNet can be achieved with our multi-tower training paradigm, which we discuss next.
3.3 Feature Fusion and Multi-tower Training
With lifted image votes and its corresponding semantic and texture cues ( in the fusion block in Fig. 2) as well as the point cloud features with the seed points , each seed point can generate 3D votes and aggregate them to propose 3D bounding boxes (through a voting and proposal module similar to that in [voteNet]). Yet it takes extra care to optimize the deep network to fully utilize cues from all modalities. As a recent paper [wang2019makes] mentions, without a careful strategy, multi-modal training can actually result in degraded performance as compared to a single modality training. The reason is that different modalities may learn to solve the task at different rates so, without attention, certain features may dominate the learning and result in over-fitting. In this work, we follow the gradient blending strategy introduced in [wang2019makes]
to weight the gradient for different modality towers (by weighting the loss functions).
In our multi-tower formulation, as shown in Fig. 2, we have three towers taking seed points with three sets of features: point cloud features only, image features only and joint features. Each tower has the same target task of detecting 3D objects – but they each have their separate 3D voting and box proposal network parameters as well as their separate losses. The final training loss is the weighted sum of three detection losses:
Within the image tower, while image features alone cannot localize 3D objects, we have leveraged surface point geometry and camera intrinsic to have pseudo 3D votes that are good approximations to the true 3D votes. So combining this image geometric cue with other semantic/texture cues we can still localize objects in 3D with image features only.
Note that, although the multi-tower structure introduces extra parameters, at inference time we no longer need to compute for the point cloud only and the image only towers – therefore there is minimal computation overhead.
In this section, we first compare our model with previous state-of-the-art methods on the challenging SUN RGB-D dataset (Sec. 4.1). Next, we provide visualizations of detection results showing how image information helps boost the 3D recognition (Sec. 4.2). Then, we present an extensive set of analytical experiments to validate our design choices (Sec. 4.3). Finally, we test our method in the conditions of very sparse depth, and demonstrate its robustness (Sec. 4.4) in such scenarios.
4.1 Comparing with State-of-the-art Methods
We use SUN RGB-D [silberman2012indoor, janoch2013category, xiao2013sun3d, song2015sun] as our benchmark for evaluation, which is a single-view 333We do not evaluate on the ScanNet dataset [dai2017scannet] as in VoteNet because ScanNet involves multiple 2D views for each reconstructed scene – thus requires extra handling to merge multi-view features. RGB-D dataset for 3D scene understanding. It consists of 10K RGB-D images, with 5K for training. Each image is annotated with amodal oriented 3D bounding boxes. In total, 37 object categories are annotated. Following standard evaluation protocol [song2016deep], we only train and report results on the 10 most common categories. To feed the data to the point cloud backbone network, we convert the depth images to point clouds using the provided camera parameters. The RGB image is aligned to the depth channel and is used to query corresponding image regions from scene 3D points.
Methods in comparison.
We compare ImVoteNet with previous methods that use both geometry and RGB. Moreover, since previous state-of-the-art (VoteNet [voteNet]) used only geometric information, to better appreciate the improvement due to our proposed fusion and gradient blending modules we add two more strong baselines by extending the basic VoteNet with additional features from image.
Among the previous methods designed for RGB-D, 2D-driven [lahoud20172d], PointFusion [xu2018pointfusion] and F-PointNet [qi2018frustum] are all cascaded systems that rely on 2D detectors to provide proposals for 3D. Deep Sliding Shapes [song2016deep] designs a Faster R-CNN [ren2015faster] style 3D CNN network to generate 3D proposals from voxel input and then combines 3D and 2D RoI features for box regression and classification. COG [ren2016three] is a sliding shape based detector using 3D HoG like feature extracted from RGB-D data.
As for the variations of VoteNet [voteNet], the first one, ‘+RGB’, directly appends the the RGB values as a three-dimensional vector to the point cloud features (of the seed points). For the second one (‘+region feature’), we use the same pre-trained Faster R-CNN (as in our model) to obtain the region-level one-hot class confidence feature, and concatenate it to the seed points inside that 2D box frustum. These two variations can also be viewed as ablated versions of our method.
Table 1 shows the per-class 3D object detection results on SUN RGB-D. We can see that our model outperforms all previous methods by large margins. Especially, it improves upon the previously best model VoteNet by 5.7 mAP, showing effectiveness of the lifted 2D image votes. It gets better results on nearly all categories and has the biggest improvements on object categories that are often occluded (+12.5 AP for bookshelves) or geometrically similar to the others (+11.6 AP for dressers and +7.7 AP for nightstands).
Compared to the variations of the VoteNet that also uses RGB data, our method also shows significant advantages. Actually we find that naively appending RGB values to the point features resulted in worse performance, likely due to the over-fitting on RGB values. Adding region features as a one-hot score vector helps a bit but is still inferior compared to our method that more systematically leverage image votes.
4.2 Qualitative Results and Discussion
In Fig. 4, we highlight detection results of both the original VoteNet [voteNet] (with only point cloud input) and our ImVoteNet with point cloud plus image input, to show how image information can help 3D detection in various ways. The first example shows how 2D object localization and semantic help. We see a cluttered bookshelf that was missed by the VoteNet but thanks to the 2D detection in the images, we have enough confidence to recognize it in our network. The image semantics also help our network to avoid the false positive chair as that in the VoteNet output (coffee table and candles confused the network there). The second example shows how images can compensate depth sensor limitations. Due to the color and material of the black sofa, there is barely any depth point captured for it. While VoteNet completely misses the sofa, our network is able to pick it up. The third example shows how image cues can push the limit of 3D detection performance, by recovering far away objects (the desk and chairs in the back) that are even missed in the ground truth annotations.
4.3 Analysis Experiments
In this subsection, we show extensive ablation studies on our design choices and discuss how different modules affect the model performance. For all experiments we report mAP@0.25 on SUN RGB-D as before.
Analysis on geometric cues.
To validate that geometric cues lifted from 2D votes help, we ablate geometric features (as in Eq. 6) passed to the 3D seed points in Table (a)a. We see that from row 1 to row 3, not using any 2D geometric cue results in a 2.2 point drop. On the other hand, not using the ray angle resulted in a 1.2 point drop, indicating the ray angle helps provide corrective cue to the pseudo 3D votes.
Analysis on semantic cues.
shows how different types of region features from the 2D images affect 3D detection performance. We see that the one-hot class score vector (probability score for the detected class, other classes set to 0), though simple, leads to the best result. Directly using the 1024-dimRoI features from the Faster R-CNN network actually got the worst number likely due to the optimization challenge to fuse this high-dim feature with the rest point features. Reducing the 1024-dim feature to 64-dim helps but is still inferior to the simple one-hot score feature.
Analysis on texture cues.
Table (c)c shows how different low-level image features (texture features) affect the end detection performance. It is clear that the raw RGB features are already effective while the more sophisticated per-pixel CNN features (from feature pyramids [lin2017feature] of the Faster R-CNN detector) actually hurts probably due to over-fitting. More details are in the supplementary material.
Table 3 studies how tower weights affect the gradient blending training. We ablate with a few sets of representative weights ranging from single tower training (the first row), dominating weights for each of the tower (2nd to 4th rows) and our best set up. It is interesting to note that even with just the image features (the 1st row, 4th column) i.e. the pseudo votes and semantic/texture cues from the images, we can already outperform several previous methods (see Table 1), showing the power of our fusion and voting design.
4.4 Detection with Sparse Point Clouds
While depth images provide dense point clouds for a scene (usually 10k to 100k points), there are other scenarios that only sparse points are available. One example is when the point cloud is computed through visual odometry [nister2004visual] or Structure from Motion (SfM) [koenderink1991affine] where 3D point positions are triangulated by estimating poses of a monocular camera in multiple views. With such sparse data, it is valuable to have a system that can still achieve decent detection performance.
To analyze the potential of our model with sparse point clouds, we simulate scans with much less points through two types of point sub-sampling: uniformly random sub-sampling (remove existing points with a uniform distribution) and ORB[rublee2011orb] key-point based sub-sampling (sample ORB key points on the image and only keep 3D points that project close to those 2D key points). In Table 5, we present detection results with different distribution and density of point cloud input. We see that in the column of “point cloud”, with decreased number of points, 3D detection performance quickly drops. On the other hand, we see including image cues significantly improves performance. This improvement is most significant when the sampled points are from ORB key points that are more non-uniformly distributed.
|point cloud settings||mAP|
|sampling method||# points||point cloud||joint|
In this work we have explored how image data can assist a voting-based 3D detection pipeline. The VoteNet detector we build upon relies on a voting mechanism to effectively aggregate geometric information in point clouds. We have demonstrated that our new network, ImVoteNet, can leverage extant image detectors to provide both geometric and semantic/texture information about an object in a format that can be integrated into the 3D voting pipeline. Specifically, we have shown how to lift 2D geometric information to 3D, using knowledge of the camera parameters and pixel depth. ImVoteNet significantly boosts 3D object detection performance exploiting multi-modal training with gradient blending, especially in settings when the point cloud is sparse or unfavorably distributed.
Appendix A Overview
Appendix B Details on ImVoteNet Architecture
In this section, we explain the details in the ImVoteNet architecture. Sec. B.1 provides details in the point cloud deep net as well as the training procedure. Further details on the 2D detector and 2D votes are described in Sec. B.2 while details on lifting 2D votes with general camera parameters are described in Sec. B.3.
b.1 Point Cloud Network
Input and data augmentation.
The point cloud backbone network takes a randomly sampled point cloud of a SUN RGB-D [song2015sun] depth image with points. Each point has its coordinate as well as its height (distance to floor). The floor height is estimated as the 1% percentile of heights of the all points. Similar to [voteNet], we augment the input point cloud by randomly sub-sampling the points from the depth image points on-the-fly. Points are also randomly flipped in both horizontal directions and randomly rotated along the up-axis by Uniform[-30,30] degrees. Points are also randomly scaled by Uniform[-.85, 1.15]. Note that the point height and the camera extrinsic are updated accordingly with the augmentation.
We adopt the same PointNet++ [qi2017pointnetplusplus] backbone network as that in [voteNet] with four set abstraction (SA) layers and two feature propagation/upsamplng (FP) layers. With input of where , the output of the backbone network is a set of seed points of where and .
As for voting, different from VoteNet that directly predicts votes from the seed points, here we fuse lifted image votes and the seed points before voting. As each seed point can fall into multiple 2D detection boxes, we duplicate a seed point times if it falls in overlapping boxes. Each duplicated seed point has its feature augmented with a concatenation of the following image vote features: 5-dim lifted geometric cues (2 for the vote and 3 for the ray angle), 10-dim (per-class) semantic cues and 3-dim texture cues. In the end the fused seed point has 3-dim coordinate and a 274-dim feature vector.
The voting layer takes the seed point and maps its features to votes through a multi-layer perceptron (MLP) with FC output sizes of 256, 256 and 259, where the last FC layer outputs XYZ offset and feature residuals (with regard to the 256-dim seed feature) for the votes. As in [voteNet], the proposal module is another set abstraction layer that takes in the generated votes and generate proposals of shape where is the number of total duplicated seed points and the output dimension consists of 2 objectness scores, 3 center regression values, numbers for heading regression ( heading bins) and numbers for box size regression ( box anchors) and numbers for semantic classification.
We pre-train the 2D detector as described more in Sec. B.2 and use the extracted image votes as extra input to the point cloud network. We train the point cloud deep net with the Adam optimizer with batch size 8 and an initial learning rate of 0.001. The learning rate is decayed by
after 80 epochs and then decayed by anotherafter 120 epochs. Finally, the training stops at 140 epochs as we find further training does not improve performance.
b.2 2D Detector and 2D Cues
2D detector training.
While ImVoteNet can work with any 2D detector, in this paper we choose Faster R-CNN [ren2015faster], which is the current dominant framework for bounding box detection in RGB. The detector we used has a basic ResNet-50 [he2016deep] backbone with Feature Pyramid Networks (FPN) [lin2017feature] constructed as . It is pre-trained on the COCO train2017 dataset [lin2014microsoft] achieving a val2017 AP of 41.0. To adapt the COCO detector to the specific dataset for 2D detection, we further fine-tune the model using ground truth 2D boxes from the training set of SUN-RGBD before applying the model only using the color channels. The fine-tuning lasts for 4K iterations, with the learning rate reduced by 10 at 3K-th iteration starting from 0.01. The batch size, weight decay, and momentum are set as 8, 1e-4, and 0.9, respectively. Two data augmentation techniques are used: 1) standard left-right flipping; and 2) scale augmentation by randomly sample the shorter side of the input image from [480,600]. The resulting detector achieves a mAP (at overlap 0.5) of 58.5 on val set.
Note that we specifically choose not to use the most advanced 2D detectors (e.g. based on ResNet-152 [he2016deep]) just for the sake of performance improvement. As our experimental results shown in the main paper, even with this simple baseline Faster R-CNN, we can already see significant boost thanks to the design of ImVoteNet.
To infer 2D boxes using the detector, we first resize the input image to a shorter side of 600 before feeding into the model. Then top 100 detection boxes across all classes for an image is aggregated. We further reduce the number of 2D boxes per image by filtering out any detection with a confidence score below 0.1. Two things to note about the 2D boxes used while training ImVoteNet: 1) we could also train with ground truth 2D boxes, however we empirically found that including them for training hurts performance, likely due to the different detection statistics at test time; 2) as the pre-training for the 2D detector is also performed on the same training set, it generally gives better detection results on SUN RGB-D train set images, to reduce the effect of over-fitting, we randomly dropped 2D boxes with a probability of 0.5.
Alternative semantic cues.
Other than the default semantic cue to represent each 2D box region as the one-hot classification score vector (the detected class has the value of the confidence score from the detector, all other locations have zeros), we further experimented with dense RoI features extracted from that region. Two variants are reported in the paper, with the 1024-dim one being the output from the last FC layer before region classification and regression. For the 64-dim one, we insert an additional FC layer before the final output layers so that region information is compressed into the 64-dim vector. The added layer is pre-trained with the 2D detector (resulting in a val mAP of 57.9) and fixed when training ImVoteNet.
Alternative texture cues.
The default texture cue is the raw RGB values (normalized to [-1,1]). Besides this simple texture cue, we also experimented more advanced per-pixel features. One handy feature that preserves such spatial information is the feature maps from FPN that fuse top-down and lateral connections [lin2017feature]. Here
is the index to the layers in the feature pyramid, which also designates the feature strides and size. For example,has a stride of 2=4 for both height and width; and a spatial size of roughly 116 of the input image444Note that different from 2D box detection, we feed the images directly to the model without resizing to shorter-side 600 to compute FPN features.. For the strides are 2=8. All feature maps have a channel size of 256, which becomes the input dimension when used as texture cues for ImVoteNet.
b.3 Image Votes Lifting
In the main paper we derived the lifting process to transform a 2D image vote to a 3D pseudo vote without considering the camera extrinsic. As the point cloud sampled from depth image points is transformed to the upright coordinate before feeding to the point cloud network (through camera extrinsic as a rotational matrix), the 3D pseudo vote also needs to be transformed to the same coordinate.
Fig. 5 shows the surface point , object center and the end point of the pseudo vote . Since the point cloud is in the upright coordinate, the point cloud deep net can only estimate the depth displacement of and along the direction (it cannot estimate the depth displacement along the direction as the rotational angles from camera to upright coordinate are unknown to the network). Therefore, we need to calculate a new pseudo vote where is on the ray and is perpendicular to the .
To calculate the we need to firstly transform and to the upright coordinate. Then assume and in the upright coordinate, we can compute:
|point cloud settings||images from SUN RGB-D [song2015sun] train set|
|sampling method||# points||1||2||3|
Appendix C Visualization of Sparse Points
In the Sec. 4.4 of the main paper we showed how image information and ImVoteNet model can be specially helpful in detections with sparse point clouds. Here in Table. 5 we visualize the sampled sparse point clouds on three example SUN RGB-D images. We project the sampled points to the RGB images to show their distribution and density. We see that the points in the first row have a dense and uniform coverage of the entire scene. After randomly subsampling the points to and points in the second and third rows respectively, we see the coverage is much more sparse but still uniform. In contrast the ORB key point based sampling (the last two rows) results in very uneven distribution of points where they are clustered around corners and edges. The non-uniformity and low coverage of ORB key points makes it especially difficult to recognize objects in point cloud only. That’s also where our ImVoteNet model showed the most significant improvement upon VoteNet.