Bi-Directional Attention for Joint Instance and Semantic Segmentation in Point Clouds

03/11/2020 ∙ by Guangnan Wu, et al. ∙ 0

Instance segmentation in point clouds is one of the most fine-grained ways to understand the 3D scene. Due to its close relationship to semantic segmentation, many works approach these two tasks simultaneously and leverage the benefits of multi-task learning. However, most of them only considered simple strategies such as element-wise feature fusion, which may not lead to mutual promotion. In this work, we build a Bi-Directional Attention module on backbone neural networks for 3D point cloud perception, which uses similarity matrix measured from features for one task to help aggregate non-local information for the other task, avoiding the potential feature exclusion and task conflict. From comprehensive experiments and ablation studies on the S3DIS dataset and the PartNet dataset, the superiority of our method is verified. Moreover, the mechanism of how bi-directional attention module helps joint instance and semantic segmentation is also analyzed.



There are no comments yet.


page 3

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Among the tasks of computer vision, instance segmentation is one of the most challenge ones which requires understand and perceive the scene in unit and instance level. Notably, the vast demands for machines to interact with real scenarios, such as robotics and autonomous driving  

[21, 13], make the instance segmentation in the 3D scene to be the hot research topic.

Though much progress has been made, 3D instance segmentation still lags far behind its 2D counterpart [23, 9, 17, 4, 3, 5]. Unlike the 2D image, the 3D scene can be represented by many forms, such as multi-view projection images, volumes, and point clouds. Generally speaking, the form of multi-view projection images makes compromises to utilize mature techniques such as 2D CNN [31, 25, 30, 8, 32] but will lose some critical information such as 3D geometry. As for representing the 3D scene as volumes [40, 19, 29, 34], it simplifies the task but will lead to expensive computation and memory cost, making them impractical for complex scenarios. In contrast, point clouds could represent a 3D scene more compactly and intuitively, and thus became more popular and drew more attention recently. The proposed PointNet [24] and some following works [26, 12, 38, 15, 11, 16, 42, 28, 39] could process the raw point clouds directly, achieving remarkable performance on 3D classification and part segmentation tasks. The success brings the prospect for more fine-grained perception tasks in 3D point clouds, such as instance segmentation.

Instance segmentation in point clouds requires distinguishing category and instance belonging for each point. The most direct way is to regress further the bounding box of each instance on the semantic segmentation task, such as [10, 43, 41]. This kind of method is usually referred to as proposal-based instance segmentation, which is straightforward, but the bounding box sometimes contains multiple objects or just a part of an object, making it hard to delineate the instance precisely. For this reason, proposal-free instance segmentation is more popular. Moreover, due to the close relationship between instance segmentation and semantic segmentation, most of the recent works approach these two tasks simultaneously and use deep neural networks with two sub-branches for the two tasks, respectively [37, 22, 45]. Among them, many take feature fusion strategy letting features for one task promote the other task. However, in fact, the features of the two tasks are not completely compatible with each other. While points belong to different semantics must belong to different instances, points in the different instances are not necessarily of the different semantics. Obviously, directly concatenating or adding these two kinds of features in the model may lead to task conflict.

Actually, with simple element-wise feature fusion way such as concatenating and adding, only semantic features could always help distinguish instances in all the cases. We will discuss the details in Sec. 2 and Sec. 3. This situation poses a question, do we still need instance features for semantic segmentation and how to make these two tasks mutually promoted? In this work, we invested another way to incorporate features for semantic and instance segmentation. Instead of explicitly fusing features, we use similarity information implied in features for one task to assist the other task. Specifically, we first measure pair-wise similarity on semantic features to form the semantic similarity matrix, with which we propagate instance features. The propagation operation computes the response at a point as a weighted sum of the features at all points with semantic similarity as weight. Finally, the responses are further concatenated to the original instance features for instance segmentation. The same steps are also conducted in another direction that computing instance similarity matrix to propagate semantic features for semantic segmentation. The propagation operation could aggregate non-local information and is also referred to as attention [36, 33, 44, 7]. Therefore, we name this kind of module as Bi-Directional Attention and call our networks as BAN.

Figure 1: Instance and semantic segmentation in point clouds using BAN. (a) Results on the S3DIS dataset, (b) Results on the PartNet dataset.

The help of aggregating non-local information lies in the following aspects. First, for attention applied to instance features for instance segmentation, semantic similarity matrix would help push instance features belonging to the different semantic apart. Though it will also pull instance features belonging to the same semantic together, the concatenated original instance features could still guarantee the difference distinguishable. Second, for attention applied on semantic features for semantic segmentation, instance similarity matrix would let semantic within each instance more consistent, thus improve the detail delineation. In addition to the positive effects when using bi-directional attention in a forward manner, the attention operation will also be good for back-propagating uniform gradients within the same semantic or instance. Consequently, our Bi-Directional Attention module could aggregate the features more properly and avoid potential task conflict. We compare our BAN to state-of-the-art methods on prevalent 3D point cloud datasets, including S3DIS[1] and PartNet [20]. We demonstrate two instance and semantic segmentation results in Fig. 1

. In experiments, our method demonstrates consistent superiority according to most of the evaluation metrics. Moreover, we conduct detailed ablation and mechanism studies, which suggests that the similarity matrices truly reflect the required pair-wise semantic and instance similarities. With attention operations from two directions together sequentially, we can reach the best performance. Our code has been open sourced.

2 Related Works

In this section, we will revisit some most relevant works of instance segmentation in point clouds. These works could be divided into two types in general, proposal-based and proposal-free.

2.1 Proposal-based methods

Most proposal-based methods in point clouds also follow the scheme of Mask R-CNN [9] in 2D images, which forms instance segmentation as joint object bounding box regression and semantic segmentation. 3D-SIS [10] and GSPN [43] rely on anchors and two-stage training, which will spend additional time to prune the dense object proposals. BoNet [41] directly regresses bounding box prediction without anchors. However, only global features are used to regress rough instance boxes.

2.2 Proposal-free methods

Proposal-free methods directly produce representations to estimate the semantic categories and cluster the instance groups. SGPN 

[35] learns a similarity matrix to group instance and treats semantic segmentation as a standalone task. 3D-BEVIS [6] gets additional instance feature from birds-eye-view, but still considers semantic segmentation independent of instance segmentation.

In view of the close relationship between instance and semantic segmentation, many works started to study how to incorporate the features of two tasks efficiently for mutual benefits. JSIS3D [22] uses multi-value conditional random field to fuse semantic and instance, but it requires some approximation to optimize. ASIS [37]

fuses semantic features to instance features by element-wise add to help distinguish instances of the different semantics. Besides, the KNN is used to assemble more instance features from the neighborhood to each point and make the assembled feature more robust, but it is non-differentiable and will break the back-propagation chain. The use of KNN in this work could be considered as proto non-local operation.

The most recent work JSNet [45] fuses semantic and instance features to each other by simple aggregation strategies such as element-wise add and concatenate operations. In this way, the problem can be formalized as the following equations:


where and represent semantic and instance features of point respectively, and and are the semantic category and instance group of point . is some simple feature aggregating method. We use and to represent mapping functions for semantic and instance segmentation, respectively.

Ideally, there are three cases for two points and : (1) = and ; (2) = and =; (3) and . In the first case, for semantic segmentation , aggregating and by will make responses and far away. Thus and are hard to keep consistent, which is contrary to the case setting. In the second case, both and could get promoted by aggregating features of the same instance by . The third case will not be considered when aggregating feature, because and are not relevant in either semantic or instance. So, with the simple aggregation strategy adopted by JSNet [45], there is a potential risk of task conflict in some specific cases.

In summary, though the non-local operation and feature aggregation strategy demonstrated certain advantages, the current implementation has some crucial problems. Considering this, we invest a proper non-local feature aggregation method in this work.

3 Methodology and Implementation

3.1 Methodology

Directly adding or concatenating semantic and instance features for semantic segmentation may raise some problems as discussed in Sec. 2. However, the similarity information implied in the instance features would help semantic segmentation without any harm. Here we propose a way to use similarity information.

We adjust the point’s semantic feature as the weighted sum of semantic features of points belong to the same instance (with similar instance features). This way would make the semantic features robust and consistent within each instance, which will promote the details delineation. To enable this function and take advantage of similar information in the instance features, we design the aggregation operation as:


where and represent two kinds of features of size and respectively ( is point number and is number of channels for feature ). , and are functions to re-weighted sum values in feature dimension with learned weights. We measure similarities by inner-product of and , which results into a matrix of size . We further apply on each row to get transition matrix which is our final similarity matrix .

When is instance features, and is semantic features, this operation propagates semantic features to other points by instance similarity matrix, the adjusted semantic features will be more uniform in each instance than the original . Since there is no explicit element-wise feature adding or concatenating, using the final aggregation result for semantic segmentation will not have the problem mentioned in the last section. Besides, this aggregation operation has the non-local characteristic naturally. For these reasons, we will also use it to fuse semantic features for instance segmentation. In other words, we will conduct another aggregation operation with as semantic features and as instance features for instance segmentation. Consequently, in our method, we have two attentions with different data flow directions, which we name the Bi-Directional Attention module.

It is worth noting that the above-defined aggregation operation has a similar form as attention operation in [36], but ours has two kinds of inputs for joint instance and segmentation in point clouds. The architecture of our attention (aggregation) operation is illustrated in Fig. 2.

Figure 2: Attention operation.

3.2 Implementation

3.2.1 Networks

By connecting the Bi-Directional Attention module to the end of the feature extracting backbone, we have the Bi-Directional Attention networks (BAN), which uses two attention operations to achieve information transmission and aggregation between instance branch and semantic branch. The full pipeline of our networks is illustrated in Fig. 


Our BAN is composed of a shared encoder, and two parallel decoders to produce representations for estimating the semantic categories and clustering the instance groups. Specifically, our backbone is PointNet++ [26]. Given input point clouds of size , the backbone first extracts and encodes them into feature matrix which further decoded to semantic feature matrix of size and instance feature matrix of .

The Bi-Directional Attention module takes these two feature matrices as input and will conduct two attention operations as defined by Eq. 2. We name the attention operation that computes semantic similarity matrix applied to instance features for instance segmentation as STOI, and attention operation that computes instance similarity matrix applied to semantic features for semantic segmentation as ITOS. The output of STOI is further passed to some simple fully connected layers (FC) to produces instance embedding space (of size ), while the output of ITOS is further passed to some simple fully connected layers (FC) to give semantic prediction (of size ). To get the instance groups, we cluster the produced instance embedding space by mean-shift method [2].

There are three kinds of sequences to conduct STOI and ITOS, and they are STOI first, ITOS first, and simultaneously. Here we use STOI first because we will use pixel-level regression loss for semantic segmentation and discriminative loss for instance segmentation, and we believe semantic features will converge faster than instance features. So, semantic features will give instance segmentation task more help at the beginning. This assumption will be verified in our ablation study in Sec. 4.

Figure 3: The pipeline of proposed Bi-Directional Attention Networks (BAN).

3.2.2 Loss Function

Our loss function

has two parts, semantic segmentation loss and instance segmentation loss , and optimized at the same time:


We use classical cross-entropy loss for , and choose discriminative loss function for 2D images in [5] as . The discriminative loss has been extended to 3D point clouds and used by many works [37, 22, 45]. will penalize the grouping of the points across different instances and bring the points belonging to the same instance closer in the embedding space. For the detailed definition of loss function, please check the supplementary.

3.2.3 Derivative Analysis

The above sections have explained how our Bi-Directional Attention module gives help in a forward manner. Here we further analyze the back-propagation of proposed Eq. 2. To simplify the problem, we first give a simple version of Eq. 2 without softmax, re-weight functions, and concatenation of original features:


where is the output of simplified attention operation. In this case, the derivatives with respect to feature and are:



means matrix vectorization and

represents Kronecker Product,

is identity matrix and

is commutation matrix.

It can be seen, the similarity matrices also appear in and . As for in , it will make the gradients uniform and robust within a similar region defined by (semantic or instance), thus help optimization. As for , it computes similarities between different features of and other than points and provides another crucial information to extract robust and useful gradients.

In summary, the proposed Bi-Directional Attention module not only help joint instance and semantic segmentation by transmitting and aggregating information between instance features and semantic features, and also be good for back-propagating uniform and robust gradients.

4 Experiments

4.1 Experiments setting

4.1.1 Datasets

We study and evaluate our method on prevalent used two datasets. Stanford 3D Indoor Semantics Dataset (S3DIS) [1] contains 3D scans in 6 areas including 271 rooms. Each scanned 3D point is associated with an instance label and a semantic label from 13 categories. PartNet [20] contains 573,585 fine-grained part instances with annotations and has 24 object categories.

4.1.2 Evaluation Metrics

For semantic segmentation, we compare our BAN with others by overall accuracy(oAcc), mean accuracy (nAcc), and mean IoU (mIoU). As for instance segmentation, coverage (Cov) and weighted coverage (WCov) [27, 18, 46] are adopted. Cov and Wcov are defined as:


where ground-truth region is denoted as and predicted regions is denoted as , is the number of points in ground-truth region . Besides, the classical metrics mean precision (mPrec), and mean recall (mRec) with IoU threshold are also reported.

4.1.3 Training and Testing Details

To optimize our Bi-Directional Attention Networks (BAN), we use Adam optimizer [14] with batch size and set initial learning rate as following the “divided by every iterations” learning rate policy. During training, we carry out epochs in total and use the default parameter setting in [5] for . At test time, bandwidth is set to for mean-shift clustering. BlockMerging algorithm proposed by SGPN [35] is used to merge instances from different blocks.

For S3DIS [1], we carry out training and testing with the -fold cross-validation and split the rooms into overlapped blocks (each containing points) on the ground plane, as used in [24]. While for PartNet [20], as [35], we train and test on each object category separately and report the evaluation results as the mean of metric values over all the objects.

Method Backbone mCov mWCov mPrec mRec
Test on 6-fold cross-validation
PointNet PointNet 43.0 46.3 50.6 39.2
PointNet++ PointNet++ 49.6 53.4 62.7 45.8
SGPN PointNet 37.9 40.8 38.2 31.2
ASIS PointNet++ 51.2 55.1 63.6 47.5
BoNet PointNet++ 46.0 50.2 65.6 47.6
Ours PointNet++ 52.1 56.2 63.4 51.0
Table 1: Instance segmentation results on S3DIS dataset.
Method Backbone mAcc mIoU oAcc
Test on 6-fold cross-validation
PointNet PointNet 60.7 49.5 80.4
PointNet++ PointNet++ 69.0 58.2 85.9
ASIS PointNet++ 70.1 59.3 86.2
Ours PointNet++ 71.7 60.8 87.0
Table 2: Semantic segmentation results on S3DIS dataset.

4.2 S3DIS Results

In this section, we will compare our method (BAN) with other state-of-the-art methods, and the reported metric values are either from their papers or implemented and evaluated by ourselves when not available.

4.2.1 Instance segmentation

In Tab. 1, six methods are compared, including PointNet[24], PointNet++[26], SGPN[35], ASIS[37], BoNet[41] and our BAN. It’s worth to note that, PointNet++ has the same architecture and settings as ours except the Bi-Directional Attention module, and thus can be treated as baseline. PointNet is similar to PointNet++ except the backbone. It can be seen, our BAN outperforms baseline (PointNet++) on all the metrics, and demonstrates significant superiority compared with others.

The more detailed comparison by on each of categories are shown in Tab. 3. Ours get the highest score on most of the categories.

4.2.2 Semantic segmentation

Since SGPN[35] and BoNet[41] do not provide semantic segmentation results. For semantic segmentation, we only compare PointNet[24], PointNet++[26] and ASIS[37].

The evaluation results are shown in Tab. 2, from mAcc, mIoU, and oAcc, our method achieves the best performance consistently. Evaluations on all the semantic categories by are listed in Tab. 3, and we get the best performance on most of that.

mean ceiling floor wall beam column window method
door table chair sofa bookcase board clutter
WCov 54.7 82.1 78.3 69.2 40.0 18.4 57.9 ASIS
59.1 58.0 63.1 36.2 44.3 54.5 50.4
50.2 78.3 70.5 68.2 38.3 15.4 55.3 BoNet
56.2 51.9 67.2 24.5 36.7 42.3 47.6
56.2 82.7 76.8 69.7 44.4 20.3 60.9 Our
58.4 59.2 62.9 41.2 44.3 56.2 51.9
mIoU 59.7 93.9 95.7 74.9 36.1 30.0 53.4 ASIS
63.3 63.0 70.6 36.8 50.1 49.9 58.2
60.8 93.9 94.2 77.0 38.0 32.6 54.9 Our
64.5 65.8 68.2 38.6 52.2 52.3 58.2
Table 3: Per class results on S3DIS dataset

4.2.3 Visual Comparison

We show some visual results of semantic and instance segmentation methods in Fig. 4. From results, we can see ours are more accurate and uniform compared with ASIS [37], especially for instance segmentation as marked by red circles. We believe it is because of the applying of attention operations and the introduction of non-local information. The more studies of attention mechanisms are in Sec. 5.

Figure 4: Visual comparison of instance and semantic segmentation results on the S3DIS dataset. The first three columns are the instance segmentation results, while the last three columns show semantic segmentation results.

4.3 PartNet Results

In addition to object instance segmentation in indoor scenes, we further evaluate our method on part instance segmentation in objects using the PartNet dataset. This task is more fine-grained and thus requires more perception ability to understand the similarity between points.

The semantic and instance segmentation scores are listed in Tab. 4. We can see that the performance has a significant drop compared with the previous one. This is because the dataset contains many kinds of small semantic parts, which are difficult to perceive and predict, causing low semantic mIoU and instance mCov but relative high semantic oAcc. For this kind of dataset with small semantic parts, ASIS [37] with KNN is difficult to adapt by a fixed range control parameter. However, with the Bi-Directional Attention module, our method could compute the similarities between any of two points and achieves better results.

The visual results of the PartNet dataset are shown in Fig. 5. Our method demonstrates obvious advantages compared with ASIS [37], and produces more accurate instance and semantic segmentation, especially for some small parts as marked by red circles.

method backbone mCov mIoU oAcc
PointNet++ PointNet++ 42.0 43.4 78.4
ASIS Pointnet++ 39.3 40.2 76.7
Our model Pointnet++ 42.7 44.9 80.3
Table 4: Result on PartNet
Figure 5: Visual comparison of instance and semantic segmentation results on the PartNet dataset. The first three columns are the instance segmentation results, while the last three columns show semantic segmentation results.

5 Discussion

In this section, we intend to show more evidence to justify the design and the mechanism of the proposed Bi-Directional Attention module.

5.1 Ablation study

As mentioned in Sec. 3.2.1, there are three kinds of sequences to conduct STOI and ITOS in our Bi-Directional Attention module, and we gave an assumption to decide our design. Here, we will verify our choice and further prove the necessity to have both STOI and ITOS.

In Tab. 5, we give five rows of results for instance and semantic segmentation with different combinations and order of STOI and ITOS. The experiments are conducted on Area 5 of S3DIS [1]. We can see, by introducing STOI, the instance segmentation gets boosted. With ITOS, both instance and semantic segmentation demonstrate certain improvement, which suggests fusing instance features for semantic segmentation in our way is very effective. Moreover, considering the potential task conflict when using simple element-wise feature aggregation strategies such as adding and concatenating, the improvement is more significant. Finally, with both STOI and ITOS, and STOI first, we achieve the best results. But, with an inverse order that ITOS first, the performance shows a large drop, even worse than results without STOI and ITOS. This phenomenon verified the importance of order to conduct STOI and ITOS and is worth to be studied further in the future.

Further, we test performance when in Eq. 2 where our Bi-Directional Attention module is degraded to two independent self-attention operations [36]. The result is listed in the last row of Tab. 5. Obviously, without feature fusing, self-attention is not comparable to our method.

Ablation Instance segmentation Semantic segmentation
STOI ITOS mCov mWCov mPrec mRec mAcc mIoU oAcc
46.0 49.1 54.2 43.3 62.1 53.9 87.3
47.1 50.1 55.3 43.6 61.2 53.4 87.0
47.4 50.3 54.0 43.4 62.0 54.7 87.8
49.0 52.1 56.7 45.9 62.5 55.2 87.7
Inverse order 46.3 49.4 53.5 41.5 62.5 55.1 87.9
Self-attention 45.4 48.6 53.3 43.6 62.5 55.1 87.9
Table 5: Results of all ablation experiments on Area 5 of S3DIS.

5.2 Mechanism Study

Here, we visualize the learned instance and semantic similarity matrices defined in Eq. 2 to study and verify their mechanism. The similarity matrix is the key functional unit, which builds the pair-wise similarities and uses to weighted-sum non-local information. A good instance similarity matrix should accurately reflect the similarity relationship between all of the points, so are of size . When the instance/semantic similarity matrix trained well, it will help generate uniform and robust semantic/instance features. Besides, good instance and semantic similarity matrices will also benefit the back-propagation process, as stated in Sec. 3.2.3.

In Fig. 6

, for trained networks and each sample, we select the same row from instance similarity matrix and semantic similarity matrix, respectively, then reshape the row vector to the 3D point cloud. So, the value of each point here represents the similarity to the point corresponding to the selected row. For better visualization, we binarize the 3D point cloud to divide points into two groups, similar points (green) and dissimilar points (blue) and marked the point corresponding to the selected row by red circle. Each sample of Fig. 

6 has two chairs in the scenes. We can see that the semantic similarity matrix could basically correctly reflect the semantic similarities, and the instance similarity matrix could highlight most of the points in the same instance.

Real Scene GT Semantic Sim. Instance Sim.
Figure 6: Visualization of instance and semantic similarity matrices. One row for each sample. From left to right, they are real scene blocks (each has two chairs), ground truth (instance), point cloud reflecting semantic similarity, point cloud reflecting instance similarity.

6 Conclusion

We present Bi-Directional Attention Networks (BAN) for joint instance and semantic segmentation. Instead of element-wised fusing features for two tasks, our Bi-Directional Attention module builds instance and semantic similarity matrices from the instance and semantic features, respectively, with which two attention operations are conducted to bi-directionally aggregate features implicitly, introduce non-local information and avoid potential task conflict. Experiments on the S3DIS and PartNet datasets and method analysis suggest that the Bi-Directional Attention module could help give uniform and robust results within the same semantic or instance regions, and would also help to back-propagate uniform and robust gradients for optimization. Our BAN demonstrates significant superiority compared with baseline and other state-of-the-art works on the instance and semantic segmentation tasks consistently. Moreover, the ablation and mechanism study further verifies the design and effectiveness of the Bi-Directional Attention module.


  • [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1534–1543. Cited by: §1, §4.1.1, §4.1.3, §5.1.
  • [2] Y. Cheng (1995) Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence 17 (8), pp. 790–799. Cited by: §3.2.1.
  • [3] J. Dai, K. He, Y. Li, S. Ren, and J. Sun (2016) Instance-sensitive fully convolutional networks. In European Conference on Computer Vision, pp. 534–549. Cited by: §1.
  • [4] J. Dai, K. He, and J. Sun (2016) Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3150–3158. Cited by: §1.
  • [5] B. De Brabandere, D. Neven, and L. Van Gool (2017) Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551. Cited by: §1, §3.2.2, §4.1.3.
  • [6] C. Elich, F. Engelmann, J. Schult, T. Kontogianni, and B. Leibe (2019) 3D-bevis: birds-eye-view instance segmentation. arXiv preprint arXiv:1904.02199. Cited by: §2.2.
  • [7] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: §1.
  • [8] J. Guerry, A. Boulch, B. Le Saux, J. Moras, A. Plyer, and D. Filliat (2017) Snapnet-r: consistent 3d multi-view semantic labeling for robotics. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 669–678. Cited by: §1.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §2.1.
  • [10] J. Hou, A. Dai, and M. Nießner (2019) 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4421–4430. Cited by: §1, §2.1.
  • [11] B. Hua, M. Tran, and S. Yeung (2018)

    Pointwise convolutional neural networks

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–993. Cited by: §1.
  • [12] Q. Huang, W. Wang, and U. Neumann (2018) Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2626–2635. Cited by: §1.
  • [13] A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris (2017) Deep learning advances in computer vision with 3d data: a survey. ACM Computing Surveys (CSUR) 50 (2), pp. 1–38. Cited by: §1.
  • [14] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR). Cited by: §4.1.3.
  • [15] L. Landrieu and M. Simonovsky (2018) Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4558–4567. Cited by: §1.
  • [16] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) Pointcnn: convolution on x-transformed points. In Advances in neural information processing systems, pp. 820–830. Cited by: §1.
  • [17] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017) Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2359–2367. Cited by: §1.
  • [18] S. Liu, J. Jia, S. Fidler, and R. Urtasun (2017) Sgn: sequential grouping networks for instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3496–3504. Cited by: §4.1.2.
  • [19] D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §1.
  • [20] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su (2019) Partnet: a large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 909–918. Cited by: §1, §4.1.1, §4.1.3.
  • [21] A. Nguyen and B. Le (2013) 3D point cloud segmentation: a survey. In 2013 6th IEEE conference on robotics, automation and mechatronics (RAM), pp. 225–230. Cited by: §1.
  • [22] Q. Pham, T. Nguyen, B. Hua, G. Roig, and S. Yeung (2019) JSIS3D: joint semantic-instance segmentation of 3d point clouds with multi-task pointwise networks and multi-value conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8827–8836. Cited by: §1, §2.2, §3.2.2.
  • [23] P. O. Pinheiro, R. Collobert, and P. Dollár (2015) Learning to segment object candidates. In Advances in Neural Information Processing Systems, pp. 1990–1998. Cited by: §1.
  • [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, §4.1.3, §4.2.1, §4.2.2.
  • [25] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §1.
  • [26] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §1, §3.2.1, §4.2.1, §4.2.2.
  • [27] M. Ren and R. S. Zemel (2017) End-to-end instance segmentation with recurrent attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6656–6664. Cited by: §4.1.2.
  • [28] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari (2018) Fully-convolutional point networks for large-scale point clouds. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 596–611. Cited by: §1.
  • [29] G. Riegler, A. Osman Ulusoy, and A. Geiger (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586. Cited by: §1.
  • [30] B. Shi, S. Bai, Z. Zhou, and X. Bai (2015) Deeppano: deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters 22 (12), pp. 2339–2343. Cited by: §1.
  • [31] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §1.
  • [32] D. Thanh Nguyen, B. Hua, K. Tran, Q. Pham, and S. Yeung (2016) A field model for repairing 3d shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5676–5684. Cited by: §1.
  • [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [34] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong (2017) O-cnn: octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–11. Cited by: §1.
  • [35] W. Wang, R. Yu, Q. Huang, and U. Neumann (2018) Sgpn: similarity group proposal network for 3d point cloud instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2569–2578. Cited by: §2.2, §4.1.3, §4.1.3, §4.2.1, §4.2.2.
  • [36] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §1, §3.1, §5.1.
  • [37] X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia (2019) Associatively segmenting instances and semantics in point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4096–4105. Cited by: §1, §2.2, §3.2.2, §4.2.1, §4.2.2, §4.2.3, §4.3, §4.3.
  • [38] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 1–12. Cited by: §1.
  • [39] W. Wu, Z. Qi, and L. Fuxin (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: §1.
  • [40] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §1.
  • [41] B. Yang, J. Wang, R. Clark, Q. Hu, S. Wang, A. Markham, and N. Trigoni (2019) Learning object bounding boxes for 3d instance segmentation on point clouds. In Advances in Neural Information Processing Systems, pp. 6737–6746. Cited by: §1, §2.1, §4.2.1, §4.2.2.
  • [42] X. Ye, J. Li, H. Huang, L. Du, and X. Zhang (2018)

    3d recurrent neural networks with context fusion for point cloud semantic segmentation

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 403–417. Cited by: §1.
  • [43] L. Yi, W. Zhao, H. Wang, M. Sung, and L. J. Guibas (2019) Gspn: generative shape proposal network for 3d instance segmentation in point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3947–3956. Cited by: §1, §2.1.
  • [44] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia (2018) Psanet: point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 267–283. Cited by: §1.
  • [45] L. Zhao and W. Tao (2020) JSNet: joint instance and semantic segmentation of 3d point clouds. In

    Thirty-Fourth AAAI Conference on Artificial Intelligence

    Cited by: §1, §2.2, §2.2, §3.2.2.
  • [46] W. Zhuo, M. Salzmann, X. He, and M. Liu (2017) Indoor scene parsing with instance segmentation, semantic labeling and support relationship inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5429–5437. Cited by: §4.1.2.