Deriving a 3D map with high-level semantics is the key for intelligent navigation systems to interact with humans in the environment. A significant research effort has been invested in 3D classification and segmentation tasks. ScanNet [dai2017scannet] and Matterport [chang2017matterport3d] collect large-scale RGB-D video datasets, and provide semantic and instance annotations for 3D point clouds. Many following works [dai20183dmv, wang2018sgpn, tchapmi2017segcloud, qi2017pointnet, qi2017pointnet++, su2018splatnet, liang2018deep, graham20183d, graham2015sparse]
investigate various deep learning algorithms for 3D scenarios. Su et al.[su2015multi] obtain 3D representations by applying CNNs on 2D rendering images and aggregating multi-view features for 3D classification tasks. Leveraging the success of CNN on the image pixel grid, 3D voxel networks, such as [zhou2017voxelnet, maturana2015voxnet, graham20183d, graham2015sparse], learn spatial relationships from discretized space by 3D convolution kernels. However, voxelization may bring quantization artifacts, which limits its utility to low-resolution point cloud data. Point-based networks [qi2017pointnet, qi2017pointnet++] are proposed to alleviate the problem and directly process on input points, which are more efficient to represent geometry and flexible to different data formats such as depth sensor and lidar data. However, most of them only use geometric features without considering the features from other modalities, e.g., image features. In addition, the global context has been shown effective to 2D scene parsing tasks [ParseNet, zhao2017pyramid], but it is not yet investigated in recent 3D architectures.
We propose a unified point-based framework for 3D point cloud segmentation, as shown in Figure 2, that effectively leverages 2D pixel-level image features, 3D point-level structures, and global contexts priors within a scene. The experimental results show that 2D, 3D, and global context features benefit each other (Table 3
). To improve the wrongly estimated camera pose from structure from motion, we explore synthetic camera poses in 3D scenes. The result shows that synthetic camera pose sampling further improves our performance on ScanNet testing set fromto (Table 2). Our unified framework demonstrates superior performance over several state-of-the-art methods (Table 2): (ours) vs. (3DMV) [dai20183dmv], and (SplatNet) [su2018splatnet].
The main contributions of this work conclude as following:
We propose to effectively leverage 2D image features, geometric structures and global context priors within an entire scene into a unified point-based framework, which is shown to experimentally outperform several state-of-the-art approaches on ScanNet benchmark [ScanNetBenchmark].
We provide an in-depth analysis of various decision choices (e.g., point features, sub-volume strides, synthetic camera models) of our framework to achieve better performance.
Through experimenting on different feature combinations, we demonstrate the semantic segmentation is improved by textural, geometric and global context information.
2 Related Work
Some works [wang2018sgpn, liu2018floornet, liang2018deep, su2018splatnet, dai20183dmv] leverage both structural and textural information on different 3D tasks. 3D semantic segmentation can be categorized into image-based [hermans2014dense, mccormac2017semanticfusion], voxel-based [dai2017scannet, dai20183dmv, graham20183d], point-based [qi2017pointnet, qi2017pointnet++] and joint fusion methods [su2018splatnet, dai20183dmv, wang2018sgpn]. We briefly review these approaches and address the main differences to our work.
Hermans et al. [hermans2014dense] propose a fast 2D semantic segmentation approach based on randomized decision forests and integrate the semantic segmentation into the 3D reconstruction pipeline. McCormac et al. [mccormac2017semanticfusion] propose a SLAM system that combines CNN to obtain a 3D semantic map. However, the above methods generate the predictions purely from 2D images without fully utilizing 3D clues from geometry. Our method aims at leveraging all 2D and 3D clues within the scenes.
Voxel-based networks consist of a series of 3D convolutional kernels that learn from the input voxelized data [maturana2015voxnet]. Dai et al. [dai2017scannet]
propose a voxel-based network for 3D indoor scene semantic segmentation. The network takes a sub-volume as input and predicts the class probabilities of the central column. However, spatial redundancy occurs in the voxelized data as many voxels remain unoccupied. Sparse Convolutional Neural Networks[graham20183d, graham2015sparse] are proposed to handle the data sparsity by applying kernels on the submanifold area. They alleviate the computation cost and enable deeper 3D ConvNets with high performance on ScanNet Benchmark [ScanNetBenchmark]. However, pre-processes effort for transforming 3D point cloud data into a voxel representation is needed. The performance of 3D ConvNets rely on voxel resolutions. Our method directly works on mesh vertices of a 3D scene without the voxelization pre-processing step, which avoids tuning the voxel resolution.
is a pioneer in this direction. The authors propose a permutation-invariant network with symmetric function to handle the unordered point sets. The network basically apply a set of multi-layer perceptron (MLP) networks on each point and aggregate all the point features through a max-pooling layer.[qi2017pointnet++] improves the PointNet by proposing a hierarchical neural network that captures the fine geometric structures of small neighborhoods. However, current point-based frameworks do not utilize the information from 2D image features, which are critical for regions lacking explicit structures, such as discriminating a painting from a wall (cf. Figure 5).
Several works address 2D-3D fusion for many tasks. Liang et al. [liang2018deep] target on 3D object detection. To solve the sparsity of bird’s eye view (BEV), they retrieve 2D features as well as 3D features to produce a dense 2D BEV feature map. Liu et al. [liu2018floornet] fuse 2D and 3D features for producing semantic 2D floorplans. However, we focus on developing an algorithm to produce 3D semantic maps. Dai et al. [dai20183dmv] extract 2D image features from aligned RGB images and back-project image features into a voxel volume. Image and 3D geometry streams are jointly fused to predict 3D voxel labels. However, their method adopts a volumetric architecture, which lose the input resolution and produce spatial redundancy caused by voxelization. Su et al. [su2018splatnet] project 2D and 3D features into a permutohedral lattice and apply sparse convolutions over this sparsely populated lattice. They project -dimensional lattice features into a -dimensional permutohedral lattice space, which loses one-dimensional structural information. The sizes of lattice space are controlled by scaling matrices, which also introduce more hyper-parameters and the quantization errors during splat and slice steps. Wang et al. [wang2018sgpn] target 3D instance segmentation and extract 2D-3D features from a single 2.5D (RGBD) image and fuse them at the point-level. However, only partial object surfaces can be observed from a single RGBD. Compared with the methods aforementioned, our approach is a more generic 3D approach with handling entire 3D scenes and effectively leverages 2D image features, geometric structures and global context priors at point-level within an entire scene.
3 Proposed Framework
Given an input 3D mesh and a set of camera poses, which can be estimated using a Structure from Motion system (SfM) or obtained from synthesized camera trajectories, the goal of our framework is to produce semantic labels for the input 3D point clouds (i.e., vertices on the 3D mesh). Note that our framework is general and applicable to any 3D point clouds and 2D image pairs, not limited to 3D meshes.
Our framework, shown in Figure 2
, consists of four parts: (1) We apply a 2D CNN to extract appearance features from rendered images and back-project the features into the 3D coordinates. The 2D features are interpolated and concatenated with 3D point features as inputs for 3D point-based networks; (2) Locally, a sub-volume encoder extracts local fine details in a target 3D sub-volume; (3) Globally, global context encoder extracts global scene priors from a sampled sparse scene point sets; (4) The decoder aggregates all information: 2D image features, local features and global scene context, and produces the semantic labels for each point in the sub-volume.
3.1 Image Features to 3D Vertices
To obtain the fine-grained appearance features from a complex scene, we render a 3D mesh into 2D images from a series of camera poses, and extract 2D image features by applying a 2D segmentation network. The textural features are back-projected back into 3D spaces by the camera poses.
3.1.1 Image Feature Extraction
To extract pixel features from color images, we use DeepLab [chen2018deeplab], pre-trained on ADE20K [zhou2017scene], as our 2D segmentation network. We render color images and ground truth labels from ScanNet 3D meshes to fine-tune DeepLab [chen2018deeplab]. We use the layer before the output layer producing a 256-channel feature map with of the input image size, as our feature descriptors. We up-sample the feature map to the original image scale to obtain pixel-level features and back-project the pixelwise 256-dimensional features to 3D mesh vertices.
3.1.2 Pixel-vertex Association
Figure 3 illustrates how we associate the 2D image pixels with global 3D points. Given the camera extrinsic parameters and camera intrinsic parameters , we back-project each pixel from the image coordinate to the 3D world coordinate. We calculate the 3D coordinate of each pixel . We associate the pixel with the triangle mesh where lies. The world coordinates of three vertices that compose the triangle mesh are used to compute barycentric weights.
3.1.3 Barycentric Interpolation
Since each pixel finds the three corresponding vertices of the triangle mesh that
lies in, we propagate the image features to three triangle vertices via barycentric interpolation. For each vertex, we sum 256-dimensional features from all pixels and normalize the feature to a unit-length vector. For those vertices that are occluded by 3D geometry and do not have any mapped 2D image features, we fill zero vectors for the vertices.
3.2 Geometry Structure
3D scenes have varied layouts and contain multiple objects with different poses and locations. To encode a complex scene, we propose two point-based architectures that learn the structure details in a sub-volume and the global context priors respectively. The decoder predicts semantic labels for the points in the sub-volume from aggregated features.
3.2.1 Sub-volume Encoder
For each sub-volume, we extract a dense point set in order to preserve the structure details. In our experiments, we sample 8192 points for each sub-volume, approximately of the points in a sub-volume. Besides the Euclidean coordinates , we concatenate 256-dimensional features from DeepLab [chen2018deeplab] and the vertex normal resulting in a 262-dimensional feature of a point. We apply a series of sampling, grouping and multi-layer perceptron (MLP) as in [qi2017pointnet++] to reduce the number of points and extract features to represent each point in the sub-volume. We use four layers to encode the input point set and the number of output points of each layer are 1024, 256, 64, 16 respectively. As our input feature dimension is much larger than the original PointNet++ [qi2017pointnet++] (262 vs. 3), we increase the dimensions of each MLP layer. The parameters for each layer are [1024, 0.1, 32, (128, 128)], [256, 0.2, 32, (256, 256)], [64, 0.4, 32, (512, 512)], [16, 0.8, 32, (512, 512, 726)], where the parameters are in the format of [number of sample points, grouping radius (in meter), number of points in a group, (MLP output dimensions)]. Note that each layer consists of multiple MLPs.
3.2.2 Global Scene Encoder
We introduce a global scene encoder to learn the global priors from a relatively sparse point set that covers the whole scene. In our experiments, we sample 16384 points, which is approximately 10% of mesh vertices from the entire scene. Each point is associated with 262-dimensional features as in the sub-volume encoder. The sparse scene points with 2D image features are fed into the global context encoder to obtain the global context features. Similar to the sub-volume encoder, four layers are used. The number of sample points of each layer are 4096, 1024, 256, 128 respectively. The parameters in each layer are [(4096, 0.4, 32, (128, 128)], [1024, 0.8, 32, (256, 256)], [256, 1.2, 32, (512, 512)], [128, 1.6, 32, (512, 512, 726)].
The decoder network consisting of four layers is proposed to learn the textural, local geometry and global priors features encoded by the encoders. A layer in the decoder consists of three parts: feature concatenation, multi-layer perceptron (MLP) and feature propagation.
We concatenate three different features for each point in the target sub-volume. (1) The interpolated features from the previous layer (blue color in Figure 2); (2) the interpolated features from the global scene encoder (orange color in Figure 2); (3) the features from the skip layers in the sub-volume encoder (gray color in Figure 2).
MLP and feature propagation
Similar to the encoder, each layer consists of multiple MLPs. In our experiment, we use two MLP layers with 256 dimensions in each layer. We apply feature propagation layers as in [qi2017pointnet++] to upsample the points from the previous layer.
Four layers are deployed to decode the aggregated features and upsample points from bottleneck size to original input size, which are . Finally, the point features pass through an output layer consisting of two MLPs to produce the class probabilities for each point in the sub-volume.
3.3 Overlapping Sub-volumes
Since our network produces the points’ class probabilities in a sub-volume, we utilize a sliding window strategy to obtain the final points’ predictions in a whole scene. As shown in Table 4, the performance is significantly improved by overlapping sliding windows. We sum the class probability of each point in the overlapping region and select the class with the maximum score. In our experiments, we use the window size and stride size , which produces good results in a reasonable time, please see Table 4 for more details.
|3D Sparse Conv 111We use color voxels in 3D Sparse Conv [graham20183d] experiment for fairly comparing with 3DMV [dai20183dmv]. [graham20183d]||93.9||77.4||82.4||73.0||64.2||45.7||54.3||76.4||54.8||75.2||50.7||50.1||22.8||71.5||53.9||50.7||55.6||37.0||76.4||44.7||60.5|
|Ours w/ SfM poses||95.4||82.4||86.9||73.0||71.2||58.4||57.1||80.5||60.7||90.8||60.7||62.8||35.6||77.3||68.5||63.1||59.7||50.3||83.7||55.8||68.2|
|Ours w/ syn poses||95.5||84.0||87.9||72.5||71.1||60.1||65.5||80.9||61.0||87.0||60.4||63.2||39.2||78.0||72.4||64.4||60.1||46.1||77.5||57.4||69.2|
|3D Sparse Conv111We use color voxels in 3D Sparse Conv [graham20183d] experiment for fairly comparing with 3DMV [dai20183dmv]. [graham20183d]||93.7||79.9||76.2||66.3||49.6||45.2||62.0||75.0||46.9||80.8||54.0||58.6||27.5||68.8||56.1||53.7||43.3||54.0||59.8||44.1||59.8|
|Ours w/ SfM poses||94.3||79.5||79.5||74.4||57||53.9||57.1||74.6||48.5||85.9||63.5||62.8||28.7||61.2||79.8||41.8||38.6||52||64.5||44.5||62.1|
|Ours w/ syn poses||95.1||81.4||82.5||76.4||55.9||56.1||63.3||77.8||46.7||83.8||57.9||59.8||29.1||66.7||80.4||45.8||42.0||56.6||61.4||49.4||63.4|
ScanNet Benchmark [ScanNetBenchmark]
This dataset is proposed in 2017 by Dai et al. [dai2017scannet] for 2D and 3D indoor scene semantic segmentation, and is currently the largest and most challenging RGB-D reconstruction dataset. It contains 1513 RGB-D indoor scans and provides with both 3D vertex labels and 2D dense pixel labels, as well as corresponding camera parameters. We use the train/validation split provided by ScanNet, 1201 for training and 312 for validation. We perform the 20-class semantic segmentation task defined in the benchmark.
We follow the 3D evaluation metrics in ScanNet benchmark[ScanNetBenchmark]. It computes the Mean Intersect over Union (mIoU) score between the predicted labels and the ground truth labels. We use the evaluation script provided by ScanNet benchmark to obtain our validation scores, and upload our results to the online evaluation system for our testing scores.
For each scene, we randomly sample 8192 points for a sub-volume with the size of and 16384 points for the entire scene as our training samples. We check the label distribution of each sub-volume and discard the sub-volume with less than annotated vertices. We randomly rotate the entire scene points along the z-axis for data augmentation. We set the batch size to on one Nvidia P100 GPU and deploy our model on two GPUs during training. The optimizer is Adam Optimizer. The initial learning rate is and decays every steps with a power of . Weighted cross entropy loss is adopted to deal with unbalanced ground truth classes.
During testing time, we predict each point’s class label within the sub-volume region. We slide the sub-volume and overlap the prediction region to produce an entire scene semantic map and to improve the performance. We padalong the X-Y axis and slide the sub-volume through the entire scene with the window size . We set the stride size to in both X and Y direction.
We compare our framework with several state-of-the-art methods, including DeepLab [chen2018deeplab], ScanNet [dai2017scannet], PointNet++ [qi2017pointnet++], 3DMV [dai20183dmv], SparseConv [graham20183d] and SplatNet [su2018splatnet] on ScanNet validation and testing set. For validation set (cf. Table 1), we use the original source codes from DeepLab [chen2018deeplab], PointNet++ [qi2017pointnet++], 3D Sparse Conv [graham20183d] and fine-tune their model on ScanNet training set. For DeepLab [chen2018deeplab], we predict one frame sampled from every 20 frames using the same sampling rate as our method. We back-project the class probability of each pixel into the 3D coordinates and compute the class probability of each vertex via barycentric interpolation. We aggregate the class probabilities from all sampled frames to obtain the final prediction. For PointNet++ [qi2017pointnet++], we also perform the overlapping sliding window and use the same window size () as our settings for fairly comparison. We compare 3D Sparse Conv [graham20183d] as one of our baselines on validation set with color voxels and using 3D UNet [cciccek20163d] architecture. Note that we use color voxels in 3D Sparse Conv [graham20183d] experiment in order to fairly compare with 3DMV [dai20183dmv], though 3D Sparse Conv reaches higher performance with voxels when using 3D ResNet [he2016deep] as the backbone network. For testing set (cf. Table 2), we compare our method with the reported scores of ScanNet [dai2017scannet], PointNet++ [qi2017pointnet++], 3DMV [dai20183dmv] and SplatNet [su2018splatnet] from the ScanNet benchmark [ScanNetBenchmark] leader board.
|+ + +||95.4||82.4||86.9||73.0||71.2||58.4||57.1||80.5||60.7||90.8||60.7||62.8||35.6||77.3||68.5||63.1||59.7||50.3||83.7||55.8||68.2|
5 3D Point Cloud Segmentation Results
Table 1 and Table 2 summarize the average IOU for different methods on the ScanNet validation and testing set respectively. Our unified model outperforms the existing fusion-based methods, 3DMV [dai20183dmv], SplatNet [su2018splatnet] on the testing set. We conclude the reasons as follow: (1) Our framework uniformly samples the points on the object surfaces while voxel-based approaches divide the input space to voxels, which has quantization errors; (2) We simultaneously optimize the 2D textural, 3D geometrical and global context feature within a point-based framework for better predictions; (3) We preserve all the structure information without projecting the high-dimensional points to a hyper-plane as in [su2018splatnet].
6 Ablation Studies
We provide an in-depth analysis of features, stride size of the sliding window, and synthetic camera pose in the following paragraphs.
6.1 Feature Analysis
We evaluate different types of features in our unified framework: 3D coordinates (), vertex normal (), global context () and 2D image features (), as shown in Table 3. Noted that the first row () in Table 3 is equal to PointNet++ [qi2017pointnet++] in Table 1; We also add normal vectors to each point in PointNet++ [qi2017pointnet++] setting, which is the third row ( + ) in Table 3.
improve overall performance in all cases: () vs. ( + ), ( + ) vs. ( + + ), ( + + ) vs. ( + + + ). Note that global priors are not limit to single room cases, it generalizes to 21 scene types in ScanNet benchmark. We observe an interesting example that curtain and shower curtain are very difficult to distinguish without knowing the scene context information. By incorporating the global prior, the accuracy of both classes are significantly increased (curtain from 66.4% to 69.6% and shower curtain from 58.4% to 64.3%).
improves our framework from ( + ) to ( + + ), suggesting that normal vectors effectively describe the 3D scene and object structures. For example, bathtub, chair and table have very different geometric structures.
also show significant improvement to the overall performance, as 2D CNN is pre-trained on a large-scale image dataset and fine-tuned on high-resolution 2D images, which helps discriminate fine-grained details of objects without explicit structures, such as a painting on a wall. As a result, our method leads to better performance comparing to color voxel used in 3D Sparse Conv (cf. Table 1 and Table 2).
In conclusion, we demonstrate that normal vectors, color features and global priors contain different semantics in our experiments shown in Table 3. We improve the performances as the additional information is fused into our framework: (), ( + ), ( + + ), ( + + + ). It proves that texture, geometry and global context encodes different information for semantics respectively.
|Stride size||1.5 m||1.05m||0.75 m||0.45 m|
|mIoU||59.5 %||60.8 %||61.8 %||62.2 %|
6.2 Sub-volume Stride
In Table 4, we evaluate different stride sizes ranging from to , where a stride size of means no overlapping as our sub-volume size is set to . With stride size in both depth and width yields 1.3% improvement in mIoU score. The performance is further boosted to 62.2 % (2.7 % gains) for stride size . We only see marginal improvement for smaller stride sizes, but the time complexity will be exponentially increased. Therefore, we set our stride size to shows good performance while it runs in a reasonable time.
|Number of images||Vertex coverage||mIOU|
6.3 Synthetic Camera Model
We explore synthetic camera model to improve the wrongly estimated camera pose from SfM, which affects the 3D segmentation performance in ScanNet benchmark. We slice each scene into three levels of heights: , , and . At each level, we equally divide the scene width and scene depth into partitions (). As shown in Figure 4 (a) and (b), for each camera position, we render images (attitude: , , ; azimuth: , , ). As a result, we capture () images with resolution for each scene. We discard the images with insufficient context and select the image set that has the highest coverage of the scene vertices, resulting in and images in total for the validation set and the testing set. Table 5 shows the experiments of using different numbers of rendered images from synthetic camera poses, which improves the segmentation results from structure from motion (i.e., 68.2%).
We have presented a unified point-based framework for optimizing 2D image features, 3D structures and global context priors. By leveraging global context priors, we improved 3D semantic segmentation performance over several state-of-the-art methods in the ScanNet benchmark [ScanNetBenchmark], confirming the ability of our model to deliver more informative features than previous work. Our in-depth feature analysis proves that textures, geometry and global context encode different meanings for semantics. We also showed that overlapping sub-volume and synthetic camera poses further improve the prediction results.
This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 108-2634-F-002-004 and FIH Mobile Limited. We also benefit from the NVIDIA grants and the DGX-1 AI Supercomputer.