Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of Wild Scenes

04/26/2020 ∙ by Haiyan Wang, et al. ∙ CUNY Law School 0

The deficiency of 3D segmentation labels is one of the main obstacles to effective point cloud segmentation, especially for scenes in the wild with varieties of different objects. To alleviate this issue, we propose a novel deep graph convolutional network-based framework for large-scale semantic scene segmentation in point clouds with sole 2D supervision. Different with numerous preceding multi-view supervised approaches focusing on single object point clouds, we argue that 2D supervision is capable of providing sufficient guidance information for training 3D semantic segmentation models of natural scene point clouds while not explicitly capturing their inherent structures, even with only single view per training sample. Specifically, a Graph-based Pyramid Feature Network (GPFN) is designed to implicitly infer both global and local features of point sets and an Observability Network (OBSNet) is introduced to further solve object occlusion problem caused by complicated spatial relations of objects in 3D scenes. During the projection process, perspective rendering and semantic fusion modules are proposed to provide refined 2D supervision signals for training along with a 2D-3D joint optimization strategy. Extensive experimental results demonstrate the effectiveness of our 2D supervised framework, which achieves comparable results with the state-of-the-art approaches trained with full 3D labels, for semantic point cloud segmentation on the popular SUNCG synthetic dataset and S3DIS real-world dataset.



There are no comments yet.


page 1

page 4

page 5

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The The last decade has witnessed advances in 3D data capturing technologies which have become increasingly ubiquitous and paved the way for generating high accurate point cloud data including sensors such as laser scanners, time-of-flight sensors including Microsoft Kinect or Intel RealSense device, structural light sensors (e.g. iPhone X and Structure Sensor), outdoor LiDAR

sensors, etc. 3D information could significantly contribute to fine-grained scene understanding. For instance, depth information could drastically reduce segmentation ambiguities in 2D images, and surface normal in 3D data could provide important cues of scene geometry. However, 3D data are typically formed with point clouds (geometric point sets in Euclidean space), which are represented as a set of unordered 3D points with or without additional information such as the corresponding RGB images. The 3D points do not conform to the regular lattice grids as in 2D images. Directly converting point clouds to 3D regular volumetric grids might bring computation intractability due to unnecessary sparsity and high-resolution volumes. The work in PointNet 

[27] and PointNet++ [28]

have pioneered the use of deep learning for 3D point cloud processing with handling the permutation invariance problem, including reconstruction and semantic segmentation tasks. However, these methods still heavily depend on 3D aligned point-wise labels as strong supervision signals for training, which are difficult and cumbersome to prepare and annotate.

Fig. 1: Illustration of the proposed weakly 2D supervised semantic segmentation of 3D point cloud in the wild scenes. Without using point-wise 3D annotations, we leverage 2D segmentation maps of different viewpoints to supervise the 3D training process.
Term Definition
Supervised Learning
learns mapping function between input and output pairs using fully labeled training examples.
Weakly Supervised Learning
learns mapping function between input and output pairs using coarse or imprecise labels instead of
fully labeled training examples.
Truncated Point Cloud
refers to the points inside a frustum under a specific viewpoint in a 3D space. In our paper, it is obtained
by casting rays from the camera to the scene and extracting the points in a view (see Figure 2) and used as
the input data to our framework.
3D Label & 2D Label
3D label indicates the category label of each point for point cloud segmentation. 2D label refers to the
category label of each pixel in 2D segmentation maps.
TABLE I: Definitions of the key terms used in the paper.

Unlike existing methods which typically require expensive point-wise 3D annotations, as shown in the Figure 1, this paper tackles the task of semantic point cloud segmentation for natural scenes by only utilizing popular 2D supervision signals such as 2D segmentation maps to supervise the 3D training process. We argue that 2D supervision is capable of providing sufficient guidance information to train 3D semantic scene segmentation models from point clouds while not explicitly capturing inherent structures of 3D point clouds. By rendering 2D pixels from the point cloud, supervised by 2D segmentation maps, our proposed framework is able to learn semantic information for each point. Compared to 3D data, 2D data are often much easier to obtain, thus save huge efforts to collect the ground truth label for each point in 3D supervision manner. Different with some recent 2D multi-view supervision-based single object 3D reconstruction approaches [21, 20, 17] (enforcing cycle-consistency or not) which solely focus on single objects and require 2D data in multiple viewpoints, our approach works on the natural scene segmentation of point clouds for multiple objects with only single view per truncated point cloud.

Occluded objects may not be correctly labeled in generating 2D segmentation maps from a given viewpoint. Due to the sparseness of point cloud and unknown spatial relation and topology of surfaces in a scene, it is challenging to determine whether 3D points belong to occluded or visible objects by just using depth distances under specific camera viewpoints. As a result, if 3D point cloud is directly projected into 2D image planes, occluded points might also appear on images which results in a misguidance for the entire scene segmentation. Therefore identifying the spatial geometry relation of objects and removing such points from the projected 2D images are crucial to the design of joint optimization strategy. In order to tackle the occlusion issue, we introduce an OBSNet(Observability Network) to provide guidance for accurate projection of segmentation maps by removing the occluded points. Given a point cloud that contains RGB and Depth information as input, the OBSNet directly outputs the visibility mask for each point. Furthermore, multiple points might collide if they are projected to same location in 2D images. Instead of simply using the depth attribute of points as a filtering mechanism, we propose a novel reprojection regime named perspective rendering to perform semantic fusion for different points which significantly alleviates the point collision problem.

The unified architecture illustrated in Figure 3 comprises a Graph-based Pyramid Feature Network (GPFN), a 2D perspective rendering module, and a 2D-3D joint optimizer. Specifically, the graph convolutional feature pyramid encoder works to hierarchically infer the semantic information of a scene in both local and global levels. The 2D perspective rendering works along with the predicted segmentation maps and the visibility masks to generate effective refined 2D maps for loss computation. The 2D-3D joint optimizer supports a complete end-to-end training. To make this paper easy to understand, we define the key terms in Table I.

Fig. 2: Illustration of a truncated point cloud. The gray dashed lines refer to the rays casting from the camera and the area contained in the red dashed lines are the truncated point cloud under a viewpoint . Note that there is one 2D RGB image corresponding to the truncated point cloud under the same viewpoint.

In an extension to our preliminary work [38]

, instead of using the distance filter to solve the object occlusion problem, we introduce an OBSNet to our framework which learns to predict the visibility mask in an end-to-end manner. In addition, we explore the transfer learning from synthetic data to real-world data for 3D point cloud segmentation task. The main contributions are summarized as follows:

  • A joint 2D-3D deep architecture is designed to compute hierarchical and spatially-aware features of point clouds by integrating graph-based convolution and pyramid structure for encoding, which further compensates weak 2D supervision information.

  • A novel re-projection method, named perspective rendering, is proposed to enforce 2D and 3D mapping correspondence. Our approach significantly alleviates the needs for 3D point-wise annotations in training, while only 2D segmentation maps are used to calculate loss with the re-projection.

  • An observability network is introduced to predict if a point is visible or occluded and to generate a visibility mask without using any additional geometry information. Combined with the segmentation map and the perspective rendering, we can further take advantages of the 2D information to supervise the whole training process.

  • To the best of our knowledge, this is the first work to apply 2D supervision for 3D semantic point cloud segmentation of wild scenes without using any 3D point-wise annotations. Extensive experiments are conducted and the proposed method achieves comparable performance with the state-of-the-art 3D supervised methods on the popular SUNCG [33] and S3DIS [2] benchmarks.

The rest of this article is organized as follows: Section 2 introduces related work in deep learning for 3D point cloud processing, 3D semantic segmentation, and 2D supervised methods for 3D tasks. Section 3 describes the details of our framework for graph-based weakly supervised point cloud semantic segmentation. Section 4 presents the datasets and experiments to evaluate the proposed weakly segmentation model. Finally, Section 5 summarizes the proposed work and points the future directions.

Fig. 3: The pipeline of the proposed deep graph convolutional framework for 2D-supervised 3D semantic point cloud segmentation. The GPFN network contains one encoder network and two decoder networks that share same encoder network. The first decoder is segmentation decoder to predict the segmentation point cloud. Another is OBSNet decoder to output the visibility of the point cloud. At last, the perspective rendering is designed to obtain the projected 2D mask which further jointly optimize the whole structure.

Ii Related Work

Ii-a Deep Learning for 3D Point Cloud Processing.

In deep learning era, early attempts at using deep learning for large 3D point cloud data processing usually replicated successful convolutional architecture by converting point sets to regular grid-like voxels [4, 7, 23, 6, 18], which extended 2D CNN to 3D CNN and integrated the volumetric occupancy representation. The main problem of voxel-based methods is the huge number of parameters in the network and the increased spatial resolution. Other methods based on k-d-tree [3] and Octree [31, 12] were proposed to deal with the point cloud data, which hierarchically partitioned the 3D Euclidean space and indexing. However, it takes expensive computation cost to build the k-d-tree and can be hard to fit the dynamic situation compared to Octree. As for Octree, even if it is much more efficiency, but object or scene can only be approximated, and not fully represented.

End-to-end deep auto-encoder networks were also employed to directly handle the point clouds. Achlioptas et al. conducted the unsupervised point cloud learning by using PointNet [27] similar encoder structure and three simple fully-connected layers as the decoder network  [1]. Although the design is simple and straight-forward, the generation model could already reconstruct the unseen object point cloud. FoldingNet [41] improved the auto-encoder design by integrating the graph-based encoder and a folding-based decoder network, which is more powerful and interpretable to reconstruct the dense and complete single object.

Recently more emerging approaches were proposed to directly feed point clouds to networks with fulfilling permutation invariance including PointNet [27], PointNet++ [28] and Frustum PointNets [29]. Meanwhile, graph convolution methods demonstrated their effectiveness to solve the point cloud problem. RGCNN and DGCNN [40, 36] were proposed to construct the graph of points first and then utilize the graph convolution to extract features. Due to the embedded topology and the geometry information in the graph structure, these networks demonstrated the potential higher capability to process point cloud data and achieved considerable successful performance on 3D point cloud-based tasks such as classification [34, 32, 30, 13], detection [9], segmentation [42], reconstruction [22], completion [43, 15], and etc. This paper focuses on the task of 3D point cloud semantic segmentation for natural scenes.

Ii-B 3D Semantic Segmentation.

Before PointNet was proposed, early deep learning-based methods have already become popular in solving 3D semantic segmentation using voxel-based methods [35, 24]. Voxelized data help raw point cloud becomes ordered and structured, and can be further processed by the standard 3D convolution. SegCloud [35]

is an end-to-end 3D point cloud segmentation framework that predicts coarse voxels first using the 3DCNN with trilinear interpolation(TI), then fully connected Conditional Random Fields (FC-CRF) were employed to refine the semantic information on the points and accomplish the 3D point cloud segmentation task. Other methods such as  

[33] and  [8] tackled the semantic scene completion from the 3D volume perspective, as well as explored the relationship between scene completion and semantic scene parsing. Song et al. are the first to perform the semantic scene completion using a single depth image as input [33]. They focused on context learning using the dilated-based 3D context module and thus well predicted the occupancy and semantic label for each voxel grid. However, the limitations of the volume-based 3D methods are that they sacrificed the representation accuracy and hard to keep high-frequency textures with limited spatial resources.

Recently, some methods were proposed to handle 3D semantic segmentation from the perspective of points and take the point cloud data as input which are permutation invariant, and output the class label for each point [27, 28]. SPG was proposed as a graph-based method to handle large scale point clouds or super points  [19]. They partitioned a 3D scan scene into super-points which are parts with simple shapes according to their geometry constrains. In conjunction with the encoded contextual relationship between points, they further increased the prediction accuracy of the semantic labels. Frameworks proposed in papers [10, 11] aimed to enlarge the receptive field of the 3D scene and explored both the input-level and output-level context information for semantic segmentation. Also, a multi-scale architecture was applied to boost performance. Wang et al. proposed a method to find the promotion between instance segmentation and semantic segmentation [39]. The authors proved that the two tasks can be linked together and improved each other. Different than the existing methods, our approach focuses on effectively utilizing easily accessible 2D training data for 3D large-scale scenes.

Ii-C 2D Supervision for 3D Tasks.

While 3D supervised semantic segmentation has made great progress, many researchers started to explore using 2D labels to train networks for 3D tasks to reduce the heavy workload of labeling 3D annotations (point clouds, voxels, meshes, etc.), albeit most of which are designed for single objects. The work proposed in  [21]

attempted to generate point clouds for object reconstruction and applied a 2D projection mask and depth mask for joint optimization. The authors introduced a pseudo-rendering in the 2D image plane, which solves the collision within a single object during projection. However, the simple up-sampling followed with a max-pooling strategy only works well with a single object. When dealing with a more complex scene that contains multiple objects, the pseudo-rendering cannot guarantee to assign correct labels for different objects when they have collision.

NavaneetK et al. [25] proposed CAPNET for 3D point cloud reconstruction. The authors introduced a continuous approximation projection module and proposed a differentiable point cloud rendering to generate a smooth and accurate point cloud projection. Through the supervision of 2D projection, their method achieved better reconstruction results compared to pseudo-rendering  [21] and showed generalizability on the real data.

Chen et al. [5] proposed a network to predict depth images from point cloud data in a coarse-to-fine manner. On the one hand, they directly predicted the depth image through the encoder-decoder network. On the other hand, they reproject the depth image to the 3D point cloud and calculated the 3D flow for each point. Combined with the 3D geometry prior knowledge and the 2D texture information, the network could iteratively refine the depth image with the ground truth and aggregate the multi-view image features.

Pittaluga et al. [26]

tackled the privacy attack task and reconstructed the RGB image from the sparse point cloud. The network took point cloud as input for a model contains three cascade Unets, and output the refined RGB image. Combined with RGB, depth, and SIFT descriptors, the first Unet estimated the visibility of each cloud point. Then the following two Unets, CoarseNet and RefineNet, are used to generate the coarse-to-fine RGB images. Novel views can also be generated by taking the virtual tours for the total scene.

Follow the track of our preliminary work [38], there are papers starting to explore the methods of applying the 2D supervision signal on the 3D scene point cloud segmentation task. Wang et al. [37] proposes a method which conducts the 2D RGB image segmentation first using Mask R-CNN [14]

, and then the 2D semantic labels are diffused to the 3D space. Through the geometry graph connection with points, they finally obtain the semantic labels for the lidar point cloud. However, this paper heavily relies on the 2D segmentation neural networks such as Mask R-CNN and they didn’t take advantage the global features of the point cloud.

This paper extends our previous work [38] and proposes an unprecedented method towards better 2D supervision for 3D point cloud semantic scene segmentation and demonstrate its effectiveness on SUNCG synthetic dataset and S3DIS real-world dataset.

Fig. 4: Illustration of the effectiveness of visibility prediction by our proposed OBSNet. (a) RGB image (just for visualization, and not used in training); (b) Truncated point cloud used as input to our network; (c) Visibility point cloud from the OBSNet; (d) 2D ground truth segmentation map under same viewpoint; (e) the projected mask without OBSNet, and (f) the projected mask after applying the OBSNet. Two areas of point cloud semantic segmentation results with collision are zoomed in to visualize better details: in the red box, points of different objects (chair and wall) are projected in the same region in the 2D image plane before adding the OBSNet. After applying the visibility calculation, they are correctly separated. The collision problem is also resolved as shown in the corresponding black boxes between the clutter (in black) and window (in purple).

Iii Methodology

Iii-a Overview

3D supervised deep models for semantic point cloud segmentation, such as PointNet [27], PointNet++ [28], and DGCNN [40], usually require 3D point-wise class labels in training and achieve satisfying results. To reduce the expensive labeling effort for each point in 3D point cloud data, here we propose a weakly 2D-supervised method by only using the 2D ground-truth segmentation map which is considerably easier to obtain to supervise the whole training process. Inspired by DGCNN [40], we propose an effective encoder-decoder network to learn the representation of the point cloud.

Figure 3 illustrates the proposed deep graph convolutional network-based framework for weakly supervised 3D semantic point cloud segmentation which comprises 2 main components: the Graph-based Pyramid Feature Network (GPFN) and the 2D optimization module. The GPFN takes the PointNet [27] similar structure as the baseline model which consists multiple MLP and max-pooling layers. Based on the baseline, the whole network contains a graph-based feature pyramid encoder and two decoder networks. A truncated point is obtained by casting rays from the camera through each pixel to the scene and extracting the points under a specific viewpoint (see details in Section III-B.) The encoder takes a truncated point cloud from a given viewpoint as input. Then in order to solve the object occlusion problem, a novel framework with double-branch decoders is designed. A segmentation decoder predicts the semantic segmentation labels while a visibility decoder estimates the visibility mask for the scene point cloud under a specific viewpoint. The segmentation map and the visibility mask are further combined to handle the point collision problem and project a sparse 2D segmentation map. During 2D Optimization, the projected sparse segmentation map is projected from the predicted segmentation point cloud by perspective rendering. Then the 2D ground truth segmentation map is finally applied to calculate the 2D sparse segmentation loss as the supervision signal in the training phase. To the best of our knowledge, this is the first work of applying weakly 2D supervision to the point cloud semantic scene segmentation task.

Iii-B Graph Convolutional Feature Pyramid Network

By casting rays from the camera through each pixel to the scene, the points under specific viewpoints are extracted respectively to obtain the truncated point cloud for multiple viewpoints. An encoder-decoder network (, ) is trained which takes the truncated input point cloud of ( is the number of the points and is the dimension of each point including and ) from a given viewpoint and predicts the class label of the point cloud with size of ( is the number of classes.)

First, the truncated point cloud from a given viewpoint is fed into the encoder network , which is comprised of a set of 1D convolution layers, edge graph convolution layers, and max pooling layers to map the input data to a latent representation space. Then the segmentation decoder network

processes the feature vector through several fully-connected layers

and finally output the class prediction of each point.

In order to conjunct with the weak 2D labels, a graph-based feature pyramid encoder is designed to subside the effect of weak labels to the point cloud segmentation. Benefiting from the dynamic graph convolution model and pyramid structure design, the network could globally capture the semantic meaning of a scene in both low-level and high-level layers. Inspired by [40], we introduce the K-NN dynamic graph edge convolution here. For each graph convolution layer, the K-NN graph is different and represented with . represents the nearest points to , and stands for edges between , … , . Through the graph convolution , the local neighborhood information is aggregated by capturing edge features between neighbors and center points:


As shown in Figure 3, two pyramid global layers are added to the GPFN. The global features are concatenated with the previous point features in both low level and high level. This pyramid design and augmented point feature matrix are effective to improve the performance when using 2D supervision.

Iii-C Visibility Estimation

In projection, the collision problem, namely the points on occluded objects and visible objects of various classes might be projected to the same location, which may cause intersections in the image plane. As shown in Figure 4, there exist collisions such as between bookcase and wall, computer and window, chair and wall, etc. In order to explore the spatial location relation of points, we need to figure out which point should be considered as visible under the specific viewpoint. So the removal of such occluded points becomes crucial to our task. Otherwise, it would be difficult to accurately utilize the 2D supervision for point cloud segmentation. In our previous work [38], we introduced the geometric-based method distance filter, which requires additional effort such as calculating the boundaries of segmentation maps to solve the occlusion problem. In this paper, we propose an end-to-end network structure that contains the OBSNet Decoder to solve the occlusion problem based on the data-driven training.

In order to simplify the solution of finding the objects’ spatial relationship and better solve the occlusion problem, we propose an end-to-end regression-based model to determine the visibility for the point cloud. As shown in Figure 3, the OBSNet shares the same encoder network with the segmentation decoder , taking the truncated input point cloud of as input and outputs the label of ”visible” or ”occluded” for each point. The OBSNet decoder is also combined with both low level and high-level features that aggregate the geometric prior and spatial information, trained in a supervised manner and predicts the single label output . Therefore, the ground-truth visibility labels are obtained through the distance filter (see more details in  [38]

) during both training and testing. As a result, a point will be eliminated if it is classified as an occluded point and will not contribute to the loss calculation in the optimization process.

During training, the two decoders could mutually help and benefit from each other. Furthermore, the OBSNet decoder helps the network separate the spatial location of objects based on the distance to the camera. To some extent, this provides a rough segmentation for the 3D scene. Overall, the segmentation model is able to learn enough semantic features and context information as guidance for the visibility prediction.

Fig. 5:

Concept illustration of the proposed perspective rendering and semantic fusion. During the projection, multiple points of different object classes (shown in different colors) are projected to grids (with corresponding colors) in the image plane. Here, each grid indicates a pixel in the image. The left-side figure demonstrates the points collision problem which multiple points might be projected to the same grid. And we provide the solution in the right-side figure. Each point has a probability distribution of the predicted classes. For the grid which has multiple projected points, the perspective rendering is applied by calculating the dot product of the probabilities for all the points according to the classes, and after normalization, the class label of this grid can be finally determined.

Iii-D Perspective Rendering

For jointly optimizing the 2D and 3D networks and solving the point collision problem, we propose an innovative projection method named perspective rendering. Point cloud in the world coordinate system is represented as . Camera pose and 3D transformation matrix for a given viewpoint are denoted as (, ). The projected point in the camera coordinate system can be derived through Eq. 2.


However, as shown in Figure 5, different points might be projected to the same pixel position in the image plane. Through Eq. 3, the perspective rendering is applied for semantic fusion by predicting the probability distribution across all classes and fusing the probability of all the points which are projected to the same pixel position. At last, the probability distribution of this pixel is obtained through semantic fusion, the largest probability such as yellow shown in Figure 5 is assigned as the final prediction label of this pixel.


Iii-E 2D Optimization

The ground-truth segmentation map and the visibility mask

are used for enforcing the consistency among the prediction results. The loss function here contains the sparse point segmentation loss

and the visibility mask loss . The sparse loss is calculated for the projected segmentation result in training as the following equation:


where is the predicted point cloud label projected to the 2D image plane. According to the 2D coordinates of the projected points, is obtained by finding the labels of the corresponding points in the ground truth segmentation map.


is using binary cross-entropy loss over non-zero valid predicting points which is similar to Eq. 4. The total loss is calculated as , where is the weighting factor.

Iv Experiments

Iv-a Datasets

The proposed weakly 2D-supervised 3D point cloud semantic segmentation method is evaluated on two public and challenging 3D wild scene datasets, including 1) SUNCG  [33], a synthetic 3D large-scale indoor scene dataset, and 2) S3DIS (Stanford Large-Scale 3D Indoor Spaces) dataset [2] derived from real environments.

SUNCG Synthetic Dataset. SUNCG [33] is a large-scale synthetic scene dataset that contains different indoor scenes with realistic rooms and furniture layouts that are manually created through the Planner5D platform. It contains rooms and object instances.

In this project, we create a total of 2D rendering sets. Each 2D rendering set comprises RGB images, depth images, and segmentation map with the corresponding camera viewpoints. The entire indoor scene point cloud can be obtained by back-projecting the depth images from every viewpoint inside a scene and fusing them together. Specifically, we only keep the rooms which have more than 15 viewpoints and related rendered depth maps. There are total 40 object categories in the dataset including wall, floor, cabinet, bed, chair, sofa, table, door, window, bookshelf, picture, counter, blinds, desk, shelves, curtain, dresser, pillow, mirror, floor_mat, clothes, ceiling, books, refrigerator, television, paper, towel, shower_curtain, box, whiteboard, person, night_stand, toilet, sink, lamp, bathtub, bag, otherstructure, otherfurniture, and otherprop. The generated truncated point cloud data used in our training process and the 2D rendering sets will be released to the public along with the acceptance of this paper.

S3DIS Real-world Dataset. The Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset contains various larger-scale natural indoor environments and is significantly more challenging than other real 3D datasets such as ScanNet [8] and SceneNN [16] datasets. It consists of 3D scan point clouds for indoor areas including a total of rooms. For each room, thousands of viewpoints are provided, including camera poses, 2D RGB images, 2D segmentation maps, and depth images under each specific viewpoint. For semantic segmentation, there are object categories including ceiling, floor, wall, beam, column, window, door, table, chair, bookcase, sofa, board, and clutter.

Iv-B Implementation Details

For both SUNCG and S3DIS datasets, each point is represented as a normalized flat vector (XYZ, RGB) with the dimension of . These truncated point clouds are used as training data as well as calculating the loss with a 2D segmentation map under the same viewpoint. Following the settings as in  [27], in which each point is represented as a 9D vector (XYZ, RGB, UVW), here UVW are the normalized spatial coordinates. In testing, the testing data are the points of the entire room similar to other 3D fully-supervised methods. For the SUNCG dataset, in total, viewpoints are selected and used to truncate point cloud as our training data, while with viewpoints as our testing data. For S3DIS, The experimental results are reported by training on the viewpoints of training data (see details in Section IV-D2) and testing on the 6-fold cross-validation over the 6 areas (area 1 - area 6). Our proposed network is trained with epochs with batch size , base learning rate and then is divided by for every iterations. The Adam solver is adopted to optimize the network on a single GPU. A connected component algorithm is employed to calculate the boundary of each instance in the ground truth segmentation map. The performance of semantic segmentation results is evaluated by the standard metrics: mean accuracy of total classes (mAcc), mean per-class intersection-over-union (mIOU), and overall accuracy (oAcc).

Iv-C Experimental Results

Method mAcc(%) mIoU(%) oAcc(%)
2D Supervision GPFN with DP (Ours) 61.9 45.0 73.4
GPFN with DP w/ (Ours) 71.9 61.2 84.5
GPFN with PR w/o (Ours) 65.3 50.8 79.1
GPFN with PR w/ (Ours) 87.3 70.37 91.8

TABLE II: Quantitative results of our proposed 2D supervised method on SUNCG dataset. ”w/” indicates ”with” and ”w/o” indicates ”without”. ”DP” indicates Direct Projection, ”PR” indicates Perspective Rendering and ”” indicates the OBSNet decoder.
Method mAcc(%) mIoU(%) oAcc(%)
3D Supervision PointNet [27] 66.2 47.6 78.5
Engelmann et al. [10] 66.4 49.7 81.1
PointNet++ [28] 67.1 54.5 81.0
DGCNN [40] - 56.1 84.1
Engelmann et al. [11] 67.8 58.3 84.0
SPG [19] 73.0 62.1 85.5
2D Supervision GPFN with DP (Ours) 39.2 30.4 53.7
GPFN with DP w/ (Ours) 59.4 42.7 70.0
GPFN with PR w/o (Ours) 54.2 39.0 66.8
GPFN with PR w/ (Ours) 66.5 50.8 79.1

TABLE III: Quantitative results without pretrained model of our proposed 2D supervised method on S3DIS dataset by using only 1/6 of viewpoints in each room for training. The performance of our 2D supervised method achieves comparable results with most of the 3D supervised state-of-the-art methods.

Iv-C1 Effectiveness of the Proposed Framework

2D Supervised-GPFN by Direct Projection without OBSNet. Instead of using 3D ground truth labels as supervision, here only 2D segmentation maps are adopted for training. The predicted point cloud with labels is re-projected to the image by direct projection according to the camera model pose (R,t), while the loss is calculated based on the 2D segmentation maps. Note that point collision might occur while the occluded object is projected to the same area as the visible object. Not surprisingly, on both the synthetic and real-world datasets, as shown in the first row under ”2D Supervision” in Table II and Table III, the performance ( mAcc, mIoU, and oAcc) on SUNCG and ( mAcc, mIoU, and oAcc) on S3DIS are extremely low.

2D Supervised-GPFN by Direct Projection with OBSNet.

We conduct experiments by adding the OBSNet decoder but still using the direct projection mentioned before. Even the point collision problem still exists, the spatial relation between the visible and occluded objects are distinguished through the . This is especially important in 3D scenes when there are multiple classes of objects. Thus on the SUNCG dataset, the performance is boosted up to ( mAcc, mIoU, and oAcc) (the 2nd row in Table II) and on S3DIS dataset, the performance is boosted up to ( mAcc, mIoU, and oAcc) (the 2nd row in Table III), which demonstrate the huge positive impact of the proposed OBSNet.

2D Supervised-GPFN by Perspective Rendering without OBSNet. We further explore the effectiveness of the perspective rendering. During the network design, we only keep the segmentation decoder and perform the semantic fusion when projecting the point cloud to the 2D image plane. In this way, the points inside each single object might be well predicted via fusion. However, for the complex scenes and multiple objects, due to the occlusion issue, the improvement is limited with the performance of ( mAcc, mIoU, and oAcc) on S3DIS dataset. For the SUNCG dataset, the environments in the dataset are complicated and the occlusion issue would more frequently occur. Semantic fusion cannot contribute much therefore the performance is only improved to ( mAcc, mIoU, and oAcc) on SUNCG dataset.

2D Supervised-GPFN by Perspective Rendering with OBSNet. Our proposed Perspective Rendering replaces the direct projection in this experiment. Combined with the OBSNet, the predicted point cloud is filtered via the visibility mask. Furthermore, the points that are projected to the same grid are selected by semantic fusion. For the synthetic dataset, since there is no other method conducting point cloud segmentation, we only compare the result among different architectures of our proposed GPFN. As shown in Table II, the result is largely improved to ( mAcc, mIoU, and oAcc) due to the associate impacts between Perspective Rendering and OBSNet. For the real-world dataset, as shown in Table III, the segmentation results ( mAcc, mIoU, and oAcc) are significantly improved and even comparable with fully 3D-supervised results.

Fig. 6: Qualitative results produced by our proposed method (middle column) on SUNCG dataset.
Fig. 7: Qualitative results produced by our proposed method on S3DIS dataset. The first column is the original point cloud in RGB format. The middle column is the segmentation results by our proposed 2D weakly supervised method. The last column is the ground truth segmentation point cloud for comparison. Overall, our method performs well in most scenes. However, when the scale of the scene is too large and contains crowded objects, the spatial relation and occlusion situation become more complicated which lead to a deficient performance such as the fourth row with many chairs in the scene.

Iv-C2 Comparison with the State-of-the-art Methods

Since there is no previous work of 2D supervised point cloud semantic segmentation for large-scale natural scenes, here, we compare our proposed framework directly with the state-of-the-arts of fully 3D supervised point cloud segmentation. As shown in Table III, by using only 2D segmentation maps, our method attains comparable results to most of the 3D supervised methods. Note that it even outperforms 3D fully supervised PointNet [27]. The most recent top-performing 3D point cloud segmentation model, SPG [19], still leads a margin in terms of mean IoU by applying a hierarchical architecture based on SuperPoints. However, the proposed approach achieves competitive performance in terms of mean Accuracy and overall Accuracy, without utilizing contextual relationship reasoning as in SPG.

Figures 6 and  7 visualize several example results on 3D point cloud semantic segmentation generated by our method on SUNCG and S3DIS respectively. Overall, our proposed 2D supervised semantic segmentation method works well in various kinds of areas and rooms contains multiple classes of objects.

Iv-D Ablation Study

In this section, we conduct a set of experiments to explore the effects of different encoder designs and various amounts of training data, as well as the accuracy of the visibility detection by OBSNet.

Iv-D1 Encoder Design

K-NN Graph Pyramid mAcc(%) mIoU(%) oAcc(%)
× × 61.3 45.1 72.6
× 65.1 48.6 78.4
× 63.5 46.4 75.3
66.5 50.8 79.1
TABLE IV: Effects of encoder structures on S3DIS dataset.

Our GPFN encoder network integrated K-NN graph structure and pyramid design. Here we conduct experiments to verify the effectiveness of these two designs for 3D point cloud semantic segmentation by still using the 2D segmentation maps as the supervision signal. As shown in Table IV, without any specific design and extra training data, by using a simple pointnet [27] similar model, it achieves ( mAcc, mIoU, and oAcc) on S3DIS dataset, which show the limited semantic encoding capability with the simple network design. By adding the K-NN graph structure to the encoder network, the performance is boosted up to ( mAcc, mIoU, and oAcc) which demonstrate the benefit of using graph convolution to encode the sparse point cloud data. Through the K-NN edge graph convolution, the edge features are extracted and aggregated to the central point. It helps to improve the classify accuracy to the point and to compensate the using 2D supervision signal. By applying the pyramid design which concatenates the global features in both low-level and high-level, the performance increases to ( mAcc, mIoU, and oAcc). By adding both the K-NN graph structure and pyramid design, the performance is boosted up to ( mAcc, mIoU, and oAcc). This proves that with integrating K-NN graph structure and pyramid design, the network is able to encode more semantic and context information and achieves better segmentation results.

Training data mAcc (%) mIoU (%) oAcc (%)
All 67.0 52.5 81.5
66.9 51.8 80.9
66.7 50.9 79.5
66.5 50.8 79.1
56.5 39.3 66.2
37.8 29.1 40.0

TABLE V: Performance comparison of using different amount of training data on S3DIS dataset.
Fig. 8: Comparison of the segmentation results for several scenes tested on the S3DIS dataset. PCL indicates the point cloud. The first row is the related 2D RGB images (for visualization only, not used in our framework) under a specific viewpoint . The second row shows our truncated point cloud which is fed as the input of our network. The third row demonstrates the output of the OBSNet under the viewpoint . The point cloud is spin for a better visualization. The blue points are the visible part while the red points are the occluded points (only for the third row). The 4th and 5th rows compare the segmentation results with and without the OBSNet. The last row is the ground truth segmentation of 3D point cloud.

Iv-D2 Amount of Training Data

The scene point clouds of the S3DIS dataset are constructed by thousands of viewpoints. Here the robustness of the proposed point cloud segmentation network is evaluated by using a different amount of training data. Table V demonstrates the performance of using various data proportion (all, of the viewpoints, they are evenly random selected in each room). There is no significant difference between using and of all viewpoints. When using the full scale or training data, due to large-scale training data, the performance is boosted up a little. However, the performance is significantly decreased when using only or viewpoints. That is because when only a few viewpoints are used, some objects might be missed and the occluded objects are hard to be visible in another viewpoint. To balance the trade-off between efficiency and accuracy, data is adopted to conduct all other experiments.

Dataset Accuracy (%)
All 1/2 1/4 1/6 1/12 1/20
S3DIS 93.0 92.6 91.7 91.2 89.6 85.0
TABLE VI: Accuracy of visibility detection by our proposed OBSNet using different amount of training data on S3DIS dataset.

Iv-D3 Visibility Detection by OBSNet

As a binary classification, the OBSNet can achieve over 90% accuracy for visibility detection compared to using distance filter. We train the OBSNet with visibility labels generated by the distance filter and quantitatively evaluate our models on S3DIS datasets, with corresponding results reported in Table VI. Given the truncated point cloud as input, the OBSNet first classifies each point as ”visible” or ”occluded”. Following the training data setting in Section IV-D2 for a fair comparison, we demonstrate the testing performance of OBSNet by using different amounts of training data. As shown in Table VI, there is only performance gap between using all and of training data. Even with only of data, the proposed model still achieves of classification accuracy. This further supports our observation that the point clouds between different viewpoints within a room are considerably overlapped with each other. Therefore, reducing training data does not significantly decrease the accuracy. The results show that OBSNet is notably robust to various data amounts of point cloud.

The effectiveness of the OBSNet is demonstrated in Figure 8. As shown in the third row, the results of the OBSNet, the occluded parts are indicated as red points and the visible parts are visualized as blue points. And through the comparison of the fourth and fifth rows, we observe that the OBSNet successfully separates the visible and occluded objects and improves the segmentation performance. As shown in the first, second, and fourth columns, the occluded parts such as the floor are correctly segmented with the OBSNet. Also in the fourth column, the lights on the ceiling are correctly separated with the visibility detection by the OBSNet.

Training Data mAcc(%) mIoU(%) oAcc(%)
Train Scratch on S3DIS 66.5% 50.8% 79.1%
Pretrained on SUNCG 67.0% 53.5% 81.3%
TABLE VII: Transfer learning from SUNCG synthetic dataset to S3DIS real-world dataset. First row shows the results on S3DIS trained from scratch without using any pretrained model. Second row shows the results finetuned on S3DIS with the pretrained model on SUNCG dataset.

Iv-E Generalization from Synthetic to Real-world

Since we are the first to explore the semantic point cloud on the SUNCG dataset, there are no other methods to compare. We further explore the domain transfer between the synthetic data to the real-world data to verify the generalization capability of our proposed model.

First, we pre-train our network on the SUNCG segmentation dataset with the learning rate of and the number of epochs is fixed to 150. The trained features are further finetuned on the S3DIS training dataset. As shown in Table VII, if only train on the S3DIS dataset from scratch, our model achieves ( mAcc, mIoU, and oAcc). After adding the pre-trained model on SUNCG, the performance on S3DIS is boosted up to ( mAcc, mIoU, and oAcc). Overall the performance consistently improves which demonstrates the generalization capability of our proposed model on real data.

V Conclusion

In this paper, we have proposed a novel deep graph convolutional model for large-scale semantic scene segmentation in 3D point clouds of wild scenes with only 2D supervision. Combined with the proposed OBSNet and perspective rendering, our proposed method can effectively obtain the semantic segmentation maps of 3D point clouds for both synthetic and real-world scenes. Different from numerous multi-view 2D-supervised methods focusing on only single object point clouds, our proposed method can handle large-scale wild scenes with multiple objects and achieves encouraging performance, with even only a single view per sample. Inferring the occluded part point cloud is the core requirement for the 3D completion task. With the help of semantic information and spatial relation between different objects in the scene, the scene point cloud reconstruction and completion will be benefited from our method. The future directions include unifying the point cloud completion and segmentation tasks for natural scene point clouds.


  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas (2018) Learning representations and generative models for 3d point clouds. In ICML, Cited by: §II-A.
  • [2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. K. Brilakis, M. A. Fischer, and S. Savarese (2016) 3D Semantic Parsing of Large-Scale Indoor Spaces. CVPR, pp. 1534–1543. Cited by: 4th item, §IV-A.
  • [3] M. Bithell and W. D. Macmillan (2007) Escape from the cell: spatially explicit modelling with and without grids. In International Journal on Ecological Modelling and Systems Ecology, Cited by: §II-A.
  • [4] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2016)

    Generative and discriminative voxel modeling with convolutional neural networks

    ArXiv abs/1608.04236. Cited by: §II-A.
  • [5] R. Chen, S. Han, J. Xu, and H. Su (2019) Point-based multi-view stereo network. ICCV abs/1908.04422. Cited by: §II-C.
  • [6] P. A. Chou, M. Koroteev, and M. Krivokuca (2019) A volumetric approach to point cloud compression—part i: attribute compression. IEEE Transactions on Image Processing 29, pp. 2203–2216. Cited by: §II-A.
  • [7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Niener (2017) ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. CVPR, pp. 2432–2443. Cited by: §II-A.
  • [8] A. Dai, D. Ritchie, M. Bokeloh, S. E. Reed, J. Sturm, and M. Niener (2018) ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans. CVPR, pp. 4578–4587. Cited by: §II-B, §IV-A.
  • [9] X. Ding, W. Lin, Z. Chen, and X. Zhang (2019) Point cloud saliency detection by local and global feature fusion. IEEE Transactions on Image Processing 28, pp. 5379–5393. Cited by: §II-A.
  • [10] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe (2017) Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds. ICCVW, pp. 716–724. Cited by: §II-B, TABLE III.
  • [11] F. Engelmann, T. Kontogianni, J. Schult, and B. Leibe (2018) Know What Your Neighbors Do: 3D Semantic Segmentation of Point Clouds. In ECCV Workshops, Cited by: §II-B, TABLE III.
  • [12] D. C. Garcia, T. A. da Fonseca, R. U. Ferreira, and R. L. de Queiroz (2019) Geometry coding for dynamic voxelized point clouds using octrees and multiple contexts. IEEE Transactions on Image Processing 29, pp. 313–322. Cited by: §II-A.
  • [13] J. Guerry, A. Boulch, B. L. Saux, J. Moras, A. Plyer, and D. Filliat (2017) SnapNet-R: Consistent 3D Multi-view Semantic Labeling for Robotics. ICCVW, pp. 669–678. Cited by: §II-A.
  • [14] K. He, G. Gkioxari, P. Dollar, and R. B. Girshick (2017) Mask R-CNN. ICCV, pp. 2980–2988. Cited by: §II-C.
  • [15] W. Hu, Z. Fu, and Z. Guo (2019) Local frequency interpretation and non-local self-similarity on graph for point cloud inpainting. IEEE Transactions on Image Processing 28, pp. 4087–4100. Cited by: §II-A.
  • [16] B. Hua, Q. Pham, D. T. Nguyen, M. Tran, L. Yu, and S. Yeung (2016) SceneNN: A Scene Meshes Dataset with aNNotations. 3DV, pp. 92–101. Cited by: §IV-A.
  • [17] E. Insafutdinov and A. Dosovitskiy (2018) Unsupervised Learning of Shape and Pose with Differentiable Point Clouds. In NeurIPS, Cited by: §I.
  • [18] M. Krivokuca, P. A. Chou, and M. Koroteev (2019) A volumetric approach to point cloud compression–part ii: geometry compression. IEEE Transactions on Image Processing 29, pp. 2217–2229. Cited by: §II-A.
  • [19] L. Landrieu and M. Simonovsky (2018) Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs. CVPR, pp. 4558–4567. Cited by: §II-B, §IV-C2, TABLE III.
  • [20] Y. Liao, Y. Yang, and Y. F. Wang (2018) 3D Shape Reconstruction from a Single 2D Image via 2D-3D Self-Consistency. CoRR abs/1811.12016. Cited by: §I.
  • [21] C. Lin, C. Kong, and S. Lucey (2018) Learning Efficient Point Cloud Generation for Dense 3D Object Reconstruction. In AAAI, Cited by: §I, §II-C, §II-C.
  • [22] P. Mandikal, L. NavaneetK., M. Agarwal, and V. B. Radhakrishnan (2018) 3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image. In BMVC, Cited by: §II-A.
  • [23] D. Maturana and S. A. Scherer (2015) VoxNet: a 3d convolutional neural network for real-time object recognition. 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §II-A.
  • [24] H. Meng, L. Gao, Y. Lai, and D. Manocha (2018) VV-net: voxel vae net with group convolutions for point cloud segmentation.

    2019 IEEE/CVF International Conference on Computer Vision (ICCV)

    , pp. 8499–8507.
    Cited by: §II-B.
  • [25] L. NavaneetK, P. Mandikal, M. Agarwal, and R. V. Babu (2019) CAPNet: Continuous Approximation Projection For 3D Point Cloud Reconstruction Using 2D Supervision. CoRR abs/1811.11731. Cited by: §II-C.
  • [26] F. Pittaluga, S. J. Koppal, S. B. Kang, and S. N. Sinha (2019) Revealing scenes by inverting structure from motion reconstructions. In CVPR, Cited by: §II-C.
  • [27] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, Cited by: §I, §II-A, §II-A, §II-B, §III-A, §III-A, §IV-B, §IV-C2, §IV-D1, TABLE III.
  • [28] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In NIPS, Cited by: §I, §II-A, §II-B, §III-A, TABLE III.
  • [29] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2017) Frustum pointnets for 3d object detection from rgb-d data.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pp. 918–927.
    Cited by: §II-A.
  • [30] C. R. Qi, H. Su, M. Niener, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and Multi-view CNNs for Object Classification on 3D Data. CVPR, pp. 5648–5656. Cited by: §II-A.
  • [31] G. Riegler, A. O. Ulusoy, and A. Geiger (2016) OctNet: learning deep 3d representations at high resolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6620–6629. Cited by: §II-A.
  • [32] B. Shi, S. Bai, Z. Zhou, and X. Bai (2015) DeepPano: Deep Panoramic Representation for 3-D Shape Recognition. IEEE Signal Processing Letters 22, pp. 2339–2343. Cited by: §II-A.
  • [33] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser (2017) Semantic Scene Completion from a Single Depth Image. CVPR, pp. 190–198. Cited by: 4th item, §II-B, §IV-A, §IV-A.
  • [34] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller (2015) Multi-view Convolutional Neural Networks for 3D Shape Recognition. ICCV, pp. 945–953. Cited by: §II-A.
  • [35] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese (2017) SEGCloud: semantic segmentation of 3d point clouds. 2017 International Conference on 3D Vision (3DV), pp. 537–547. Cited by: §II-B.
  • [36] G. Te, W. Hu, A. Zheng, and Z. Guo (2018) RGCNN: Regularized Graph CNN for Point Cloud Segmentation. In ACM Multimedia, Cited by: §II-A.
  • [37] B. H. Wang, W. Chao, Y. Wang, B. Hariharan, K. Q. Weinberger, and M. E. Campbell (2019) LDLS: 3-d object segmentation through label diffusion from 2-d images. IEEE Robotics and Automation Letters 4, pp. 2902–2909. Cited by: §II-C.
  • [38] H. Wang, X. Rong, L. Yang, S. Wang, and Y. Tian (2019) Towards Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of Wild Scenes. In BMVC, Cited by: §I, §II-C, §II-C, §III-C, §III-C.
  • [39] X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia (2019) Associatively Segmenting Instances and Semantics in Point Clouds. CoRR abs/1902.09852. Cited by: §II-B.
  • [40] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2018) Dynamic Graph CNN for Learning on Point Clouds. CoRR abs/1801.07829. Cited by: §II-A, §III-A, §III-B, TABLE III.
  • [41] Y. Yang, C. Feng, Y. Shen, and D. Tian (2017) FoldingNet: point cloud auto-encoder via deep grid deformation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 206–215. Cited by: §II-A.
  • [42] X. Ye, J. Li, H. Huang, L. Du, and X. Zhang (2018)

    3D Recurrent Neural Networks with Context Fusion for Point Cloud Semantic Segmentation

    In ECCV, Cited by: §II-A.
  • [43] W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert (2018) PCN: Point Completion Network. In 3DV, Cited by: §II-A.