ReDAL: Region-based and Diversity-aware Active Learning for Point Cloud Semantic Segmentation

07/25/2021 ∙ by Tsung-Han Wu, et al. ∙ 0

Despite the success of deep learning on supervised point cloud semantic segmentation, obtaining large-scale point-by-point manual annotations is still a significant challenge. To reduce the huge annotation burden, we propose a Region-based and Diversity-aware Active Learning (ReDAL), a general framework for many deep learning approaches, aiming to automatically select only informative and diverse sub-scene regions for label acquisition. Observing that only a small portion of annotated regions are sufficient for 3D scene understanding with deep learning, we use softmax entropy, color discontinuity, and structural complexity to measure the information of sub-scene regions. A diversity-aware selection algorithm is also developed to avoid redundant annotations resulting from selecting informative but similar regions in a querying batch. Extensive experiments show that our method highly outperforms previous active learning strategies, and we achieve the performance of 90 fully supervised learning, while less than 15 on S3DIS and SemanticKITTI datasets, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point cloud semantic segmentation is crucial for various emerging applications such as indoor robotics and autonomous driving. Many supervised approaches [19, 20, 30, 27, 6, 26] along with several large-scale datasets [1, 7, 10, 5] are recently provided and have made a huge progress.

Figure 1: Human labeling efforts (colored areas) of different learning strategies. (a) In supervised training or traditional deep active learning, all points in a single point cloud are required to be labeled, which is labor-intensive. (b) Since few regions contribute to the model improvement, our region-based active learning strategy selects only a small portion of informative regions for label acquisition. Compared with case (a), our approach greatly reduces the cost of semantic labeling of walls and floors. (c) Moreover, considering the redundant labeling where repeating visually similar regions in the same querying batch, we develop a diversity-aware selection algorithm to further reduce redundant labeling (e.g., ceiling colored in green in (b) and (c)) effort by penalizing visually similar regions.

Recent deep learning approaches have achieved great success with the aid of massive datasets. However, obtaining a large-scale point-by-point labeled dataset is still costly and challenging. Specifically, the statistics show that there would be more than 100,000 points in a room-sized point cloud scene [1, 5]. Furthermore, the annotation process of 3D point-wise data is much more complicated than that of 2D data. Unlike simply selecting closed polygons to form a semantic annotation in a 2D image [22], in 3D point-by-point labeling, annotators are asked to perform multiple 2D annotations from different viewpoints during the annotation process [10] or to label on 3D space with brushes through multiple zooming in and out and switching the brush size [5]. Therefore, such numerous points and the complicated annotation process significantly increase the time and cost of manual point-by-point labeling.

Figure 2: Not all annotated regions contribute to the model improvement. This toy experiment compares the performance contribution of fully labeled (a) and partially (b, w/o floor) labeled scans on S3DIS [1] dataset. Specifically, the training dataset contains only 4 fully-labeled point cloud scans at the beginning. Another 4 fully or partially labeled scans are then added into the dataset at each following iteration. As shown in (c), removing floor on a scan results in comparable performance on all classes including floor (blue), chairs (red), and bookcases (green), indicating that flat and large areas do not require abundant annotations. Therefore, it is crucial to select key regions to cut the annotation cost while training performance is still preserved. As shown in (d), 12% of annotation could be cut while no performance drop is paid.

To alleviate the huge burden of manual point-by-point labeling in large-scale point cloud datasets, some previous works have tried to reduce the total number of labeled point cloud scans [14] or lower the annotation density within a single point cloud scan [34]. However, they neglect that regions in a point cloud scan may not have equal contribution to the performance. As can be observed from Figure 2, for a deep learning model, only 2% of the labeled point cloud scans are needed to reach over IoU on large uniform objects, such as floors. However, more than 15% annotated data are required to achieve IoU on small items or objects with complex shapes and colors, like chairs and bookcases. Therefore, we argue that an effective point selection is essential for lowering annotation costs while preserving model performance.

In this work, we propose a novel Region-based and Diversity-aware Active Learning (ReDAL) framework general for many deep learning network architectures. By actively selecting data from a huge unlabeled dataset for label acquisition, only a small portion of informative and diverse sub-scene regions is required to be labeled.

To find out the most informative regions for label acquisition, we utilize the combination of the three terms, softmax entropy, color discontinuity and structural complexity, to calculate the information score of each region. Softmax entropy is a widely used approach to measure model uncertainty, and areas with large color differences or complex structures in a point cloud provide more information because semantic labels are usually not smooth in these areas. As can be observed from the comparison of Figure 1 (a, b), the region-based active selection strategy significantly reduces human annotation effort of original full scene labeling.

Furthermore, to avoid redundant annotation resulting from multiple individually informative but duplicated data in a query batch, which is a common problem in deep active learning, we develop a novel diversity-aware selection algorithm considering both region information and diversity. In our proposed method, we first extract all regions’ features, then measure the similarity between regions on feature space, and finally, use a greedy algorithm to penalize multiple similar regions appearing in the same querying batch. As can be observed from the comparison of Figure 1 (b, c), our region-based and diversity-aware selection strategy can avoid querying labels for similar regions and further reduce the effort of manual labeling.

Experimental results demonstrate that our proposed method significantly outperforms existing deep active learning approaches on both indoor and outdoor datasets with various network architectures. On S3DIS [1] and SemanticKITTI [5] datasets, our proposed method can achieve the performance of 90% fully supervised learning, while less than 15%, 5% annotations are required. Our ablation studies also verify the effectiveness of each component in our proposed method.

To sum up, our contributions are highlighted as follows,

  • We pave a new path for 3D deep active learning that utilizes region segmentation as the basic query unit.

  • We design a novel diversity-aware active selection approach to avoid redundant annotations effectively.

  • Experimental results show that our method can highly reduce human annotation effort on different state-of-the-art deep learning networks and datasets, and outperforms existing deep active learning methods.

2 Related Work

2.1 Point Cloud Semantic Segmentation with less labeled data

In the past decade, many supervised point cloud semantic segmentation approaches have been proposed [13, 19, 20, 30, 27, 6, 3, 15, 26]. However, despite the continuous development of supervised learning algorithms and the simplicity of collecting 3D point cloud data in large scenes, the cost of obtaining manual point-by-point marking is still high. As a result, many researchers began to study how to achieve similar performance with less labeled data.

Some have tried to apply transfer learning to this task. Wu  

[33] developed an unsupervised domain adaptation approaches to make the model perform well on real world scenarios given only synthetic training sets. However, their method can only be applied to a single network architecture [32] instead of a general framework.

Some others applied weakly-supervised learning to reduce the cost of labelling. [34] utilized gradient approximation along with spatial and color smoothness constraints for training with few labeled scattered points. However, this operation does not save much cost, for, to label all scattered points, annotators still have to switch viewpoints or zoom in and out throughout a scene during the annotation process. Besides, [31] designed a multi-path region mining module to help the classification model learn local cues and to generate pseudo point-wise labels in sub-cloud level, but their performance is still far from the current state-of-the-art method compared with the fully-supervised result.

Still some others leverage active learning to alleviate annotation burden. [16] designed an active learning approach to reduce the workload of semantic labeling required by the CRF model. However, their method can not be applied to current large-scale datasets for two reasons. First, the algorithm highly relies on the result of over-segmentation preprocessing and the algorithm cannot perfectly cut out small blocks with high purity in the increasingly complex scenes of the current real-world datasets. Second, the computation of pair-wise CRF is extremely high and thus not suitable for large-scale datasets. In addition, different selection strategies have also been designed. For example, [14] proposed segment entropy to measure the informativeness of a point cloud scan in deep active learning pipeline.

To the best of our knowledge, we are the first to design a region-based active learning framework general for many deep learning models. Furthermore, our idea of reducing redundant annotation through diversity-aware selection is totally different from those previous works.

2.2 Deep Active Learning

Sufficient labeled training data is vital for supervised deep learning model, but the cost of manual annotation is often high. Active Learning [24] aims to reduce labeling cost through selecting the most valuable data for label acquisition. [28] proposed the first active learning framework on deep learning where a batch of items is queried in each active selection for acceleration, instead of a single item as is common in classical active learning.

Several past deep active learning practices are based on model uncertainty. [28] is the first work that applied least confidence, smallest margin [21] and maximum entropy [25] to deep active learning. [29] introduced semi-supervision to active learning, which assigned pseudo-labels for instances with the highest certainty. [8, 9]

combined Bayesian active learning with deep learning, which estimated model uncertainty by MC-Dropout.

In addition to model uncertainty, many recent deep active learning works take in-batch data diversity into account. [23, 12, 2] stated that neglecting data correlation would cause similar items to appear in the same querying batch, which further leads to inefficient training. [23] converted batch selection into a core-set construction problem to ensure diversity in labeled data; [12, 2] tried to consider model uncertainty and data diversity at the same time. Empirically, uncertainty and diversity are two key indicators of active learning. [11] is a hybrid method that enjoy the benefit of both by dynamically choosing the best query strategy in each selection.

To the best of our knowledge, we design the first 3D deep active learning framework combining uncertainty, diversity and point cloud domain knowledge.

3 Method

Figure 3: Region-based and Diversity-Aware Active Learning Pipeline. In the proposed framework, a point cloud semantic segmentation model is first trained in supervision with labeled dataset . The model then produces softmax entropy and features of all regions from the unlabeled dataset . (a) Softmax entropy along with color discontinuity and structure complexity calculated from the unlabeled regions serves as selection indicators (Sec. 3.2), and (b) generates scores which are then adjusted by penalizing regions belonging to the same clusters grouped by the extracted features (Sec. 3.3). (c) The top-ranked regions are labeled by annotators and added to the labeled dataset for next phase (Sec. 3.4).

In this section, we describe our region-based and diversity-aware active learning pipeline in detail. Initially, we have a 3D point cloud dataset , which can be divided into two parts. One is a subset containing randomly selected point cloud scans with complete annotations, and the other is a large unlabeled set without any annotation.

In traditional deep active learning, the network is trained on the current labeled set under supervision initially. Then, select a batch of data for label acquisition from the unlabeled set according to a certain strategy. Finally, move the newly labeled data from to ; then, go back to step one to re-train or fine-tune the network and repeat the loop until the budget of the annotation is exhausted.

3.1 Overview

We use a sub-scene region as the fundamental query unit in our proposed ReDAL method. In traditional deep active learning, the smallest unit for label querying is a sample, which is a whole point cloud scan in our task. However, based on the prior experiment shown in Figure 2, we know that some labeled regions may contribute little to the model improvement. Therefore, we change the fundamental unit of label querying from a point cloud scan to a sub-scene region in a scan.

Instead of using model uncertainty as the only criterion to determine the selection common in 2D active learning, we leverage the domain knowledge from 3D computer vision and include two informative cues,

color discontinuity and structural complexity, in the selection indicators. Moreover, to avoid redundant labeling caused by multiple duplicate regions in a querying batch, we design a simple yet effective diversity-aware selection strategy to mitigate the problem and improve the performance.

Our region-based and diversity-aware active learning can be divided into four main steps: (1) Train on current labeled dataset in a supervised manner. (2) Calculate the region information score for each region with three indicators: softmax entropy, structure complexity and color discontinuity as shown in Figure 3 (a) (Sec. 3.2). (3) Perform diversity-aware selection by measuring the similarity between all regions and using a greedy algorithm to penalize similar regions appearing in a querying batch as shown in Figure 3 (b) (Sec. 3.3). (4) Select top-K regions for label acquisition, and move them from the unlabeled dataset into the current labeled dataset as shown in Figure 3 (c) (Sec. 3.4).

3.2 Region Information Estimation

We divide a large-scale point cloud scan into some sub-scene regions as the fundamental label querying units using VCCS [17]

algorithm, an unsupervised over-segmentation method that groups similar points into a region. The original purpose of this algorithm was to cut a point cloud into multiple small regions with high segmentation purity to reduce the computational burden of the probability statistical model. Different from the original purpose of requiring high purity, our method merely utilizes the algorithm to divide a scan into sub-scenes of median size for better annotation and learning. An ideal sub-scene consists of several but not complicated semantic meanings, while preserving geometric structures of point cloud.

In each active selection step, we calculate the information score of a region from three aspects: (1) softmax entropy, (2) color discontinuity, and (3) structural complexity, which is described in detail as follows.

3.2.1 Softmax Entropy

Softmax entropy is a widely used approach to measure the uncertainty in active learning [28, 29]. We first obtain the softmax probability of all point cloud scans in the unlabeled set with the model trained in the previous active learning phase. Then, given the softmax probability of a point cloud scan, we calculate the region entropy for the -th region by averaging the entropy of points belonging to the region as shown in Eq. 1.

(1)

3.2.2 Color Discontinuity

In 3D computer vision, the color difference is also an important clue since areas with large color differences are more likely to indicate semantic discontinuity. Therefore, it is also included as an indicator for measuring regional information. For all points in a given point cloud with color intensity value , we compute the -norm color difference between a point and its -nearest neighbor (). Then we produce the region color discontinuity score for the -th region by averaging the values of points belonging to the region as shown in Eq. 2.

(2)

3.2.3 Structural Complexity

We also include structure complexity as an indicator, since complex surface regions, boundary places, or corners in a point cloud are more likely to indicate semantic discontinuity. For all points in a given point cloud, we first compute the surface variation based on [4, 18]. Then, we calculate the region structure complexity score for the -th region by averaging surface variation of points belonging to the region as shown in Eq. 3.

(3)

After calculating the softmax entropy, color discontinuity, and structural complexity of each region, we perform a linear combination of these terms to form the region information score of the -th region as shown in Eq. 4.

(4)

Finally, we rank all regions in descending order based on the region information scores and produce a sorted information list . The above process is illustrated in Figure 3 (a).

3.3 Diversity-aware Selection

With the sorted region information list , a naive way is to select the top-ranked regions for label acquisition directly. Nevertheless, this strategy results in multiple visually similar regions being in the same batch as shown in Figure 4. These regions, though informative individually, provide less diverse information for the model.

Figure 4: Our method is able to find out visually similar regions not only in the same point cloud (a) but also in different point clouds (b). The areas colored in red are the ceiling in an auditorium (a) and walls next to the door (b). These regions may cause redundant labeling effort if appearing in the same querying batch, and thus they are filtered by our diversity-aware selection (Sec. 3.3.1).

To avoid visually similar regions appearing in a querying batch, we design a diversity-aware selection algorithm divided into two parts: (1) region similarity measurement and (2) similar region penalization.

3.3.1 Region Similarity Measurement

We measure the similarity among regions in the feature space rather than directly on point cloud data because the scale, shape, and color of each region are totally different.

Given a point cloud scan with points, we record the output before the final classification layer as the point features with shape . Then, we produce the region features by calculating the average of the point features of the points belonging to the same region. Finally, we gather all point cloud regions and use -means algorithm to cluster these region features. The above process can be seen in the middle of Figure 3 (b).

After clustering regions, we regard regions belonging to the same cluster as similar regions. An example is shown in Figure 4.

3.3.2 Similar Region Penalization

To select diverse regions, a greedy algorithm takes the sorted list of information scores as input and re-scores all regions by penalizing regions with lower scores that belong to the same clusters containing regions with higher scores.

The table in the right of Figure 3 (b) offers an example where the algorithm loops through all regions one by one. The scores of regions ranked below while belonging to the same cluster as the current region are multiplied by a decay rate . To be specific, the red, green, and blue dots in the left of the table denote the cluster indices of regions, and denotes the score of the region with -time penalization. Yellow circles under indicate the current region to be compared in each iteration, and rounded rectangles mark regions belonging to the same cluster as the current region. In the first iteration, is penalized as and belong to the same cluster denoted by blue dots. ’s score is then replaced by to mark the first decay. In the third iteration, and are both penalized as , and belong to the same cluster denoted by green dots. Their scores are thus substituted by and . The same logic applies to the other iterations. Then we obtain the adjusted scores for label acquisition.

Note that in our implementation shown in Algorithm 1, we penalize the corresponding importance weight , which are initialized to for all clusters, instead of directly penalizing the score for efficiency. Precisely, in each iteration, we adjust the score of the pilot by multiplying the importance weight of its corresponding cluster. Then, the importance weight of this cluster is multiplied by the decay rate .

Input: Original sorted information score and corresponding -cluster region labels ; cluster importance weight and decay rate
Output: Final Region information score
;
for  to  do
       ;
       ;
      
end for
return
Algorithm 1 Similar Region Penalization

3.4 Region Label Acquisition

Once getting the final information score by considering region diversity, we select regions into a querying batch in decreasing order according to for label acquisition until the budget of this round is exhausted. Note that in each label acquisition step, we set the budget as a fixed number of total points instead of a fixed number of regions for fair comparison since each region contains different number of points.

For experiments, after selecting the querying batch data, we regard the ground truth region annotation as the labeled data obtained from human annotators. Then, these regions are moved from unlabeled set to labeled set . Note that different from the 100% fully labeled initial training point cloud scans, since we regard a region as the basic labeling unit, many point cloud scans with only a small portion of labeled region are appended to the labeling data set in every active selection step as shown in Figure 3 (c).

Finishing the active selection step containing region information estimation, diversity-aware selection and region label acquisition, we repeat the active learning loop to fine-tune the network on the updated labeled dataset .

4 Experiments

Figure 5: Experimental results of different active learning strategies on 2 datasets and 2 network architectures. We compare our region-based and diversity-aware active selection strategy with other existing baselines. It is obvious that our proposed method outperforms any existing active selection approaches under any combinations. Furthermore, our method is able to reach 90% fully supervised result with only 15%, 5% labeled points on S3DIS [1] and SemanticKITTI [5] dataset respectively.

4.1 Experimental Settings

In order to verify the effectiveness and universality of our proposed active selection strategy, we conduct experiments on two different large-scale datasets and two different network architectures. The implementation details are explained in the supplementary material due to limited space.

Datasets.

We use S3DIS [1] and SemanticKITTI [5] as representatives of indoor and outdoor scenes, respectively. S3DIS is a commonly used indoor scene segmentation dataset. The dataset can be divided into 6 large areas, with a total of 271 rooms. Each room has a corresponding dense point cloud with color and position information. We evaluate the performance of all label acquisition strategies on the Area5 validation set and perform active learning on the remaining datasets. SemanticKITTI is a large-scale autonomous driving dataset with 43552 point cloud scans from 22 sequences. Each point cloud scan is captured by LiDAR sensors with only position information. We evaluate the performance of all label acquisition strategies on the official validation split (seq 08) and perform active learning on the whole official training split (seq 0007 and 0910)

Network Architectures.

To verify that the a strategy is applicable to various deep learning networks, we use MinkowskiNet [6], based on sparse convolution, and SPVCNN [26], based on point-voxel CNN thanks to the great performance on large-scale point cloud datasets with high inference speed.

Active Learning Protocol.

For all experiments, we first randomly select a small portion (%) of fully labeled point cloud scans from the whole training data as the initial labeled data set , and treat the rest as unlabeled set . Then, we perform rounds of the following actions: (1) Train the deep learning model on in a supervised manner. (2) Select a small portion (%) of data from for label acquisition according to different active selection strategies. (3) Add the newly labeled data into and finetune the deep learning model.

We choose , , and for S3DIS dataset, and , , and for SemanticKITTI dataset. To ensure the reliability of the experimental results, we perform the experiments for 3 times and record the average value on each experimental settings.

4.2 Comparison among different active selection strategies.

We compare our proposed method with 7 other active selection strategies, including random point cloud scans selection (RAND), softmax confidence (CONF) [28], softmax margin (MAR) [28], softmax entropy (ENT) [28], MC-dropout (MCDR) [8, 9], core-set approach, (Core-Set) [23] and segment-entropy (SEGENT) [14]. The implementation details of these practices are explained in the supplementary material.

The experimental results can be seen in Figure 5. In each subplot, the x-axis means the percentage of labeled points and the y-axis indicates the mIoU achieved by the network architecture. Our proposed ReDAL significantly surpasses other existing active learning strategies under any combination.

In addition, we observe that random selection (RAND) outperforms any other active learning methods except ours on four experiments. For uncertainty-based methods, such as ENT and MCDR, since the model uncertainty value is dominated by background area, the performance is not as expected. The same, for pure diversity approach, such as CSET, since the global feature is dominated by background area, simply clustering global feature cannot produce diverse label acquisition. The experimental results further verify our proposal of changing the fundamental querying units from a scan to a region is a better choice.

On the S3DIS [1] dataset, our proposed active selection strategy can achieve more than 55 % mIoU with 15 % labeling points, while others cannot reach 50 % mIoU under the same condition. The main reason for such a large performance gap is that these room-sized point cloud data in the S3DIS dataset is very different. Compared with other active selection methods querying a batch of point cloud scans, our region-based label acquisition can make the model be trained on more diverse labeled data.

As for the SemanticKITTI [5], we find that with merely less than 5% of labeled data, our active learning strategy can achieve 90% of the result of fully supervised method. With the network architecture of MinkowskiNet, our active selection strategy can even reach 95 % fully supervised result with only 4 % labeled points.

Furthermore, in Table 1, the performance of some small or complicated class objects like bicycle and bicyclist is even better than fully supervised one. Table 2 shows the reason that our selected algorithm focuses more on those small or complicated objects. In other words, our ReDAL does not waste annotation budgets on easy cases like uniform surfaces, which again realizes the observation and motivation in the introduction. What’s more, the selection strategy makes it more friendly to the real-world application, such as autonomous driving, due to the property to put emphasis on more important and valuable semantics.

Method avg road person bicycle bicyclist
RAND 54.7 90.2 52.0 9.5 47.7
Full 61.4 93.5 65.0 20.3 78.4
ReDAL 59.8 91.5 63.4 29.5 84.1
Table 1: Results of IoU performance (%) on SemanticKITTI [5]. Under only 5 % of annotated points, our proposed ReDAL outperforms random selection and is on par with fully supervision (Full).
Method road person bicycle bicyclist
RAND 206 0.42 0.15 0.10
Full 205 0.35 0.17 0.13
ReDAL 168 1.20 0.25 0.21
Table 2: Labeled Class Distribution Ratio (). With limited annotation budgets, our active method ReDAL queries more labels on small objects like person but less on large uniform areas like road. The selection strategy can mitigate the label imbalance problem and improve the performance on more complicated object scenes without hurting much on large areas as shown in Table 1.
Figure 6: Visualization for the inference result on S3DIS dataset with SPVCNN network architecture. We show some inference examples on S3DIS Area 5 validation set. With our active learning strategy, the model is able to produce sharp boundary (shown on the yellow bounding box in the first row) and recognize small objects, such as boards and chairs (shown on the yellow bounding box in the second row) with only 15 % labeled points.
Figure 7: Visualization for the inference result on SemanticKITTI dataset with MinkowskiNet network architecture. We show some inference examples on SemanticKITTI sequence 08 validation set. With our active learning strategy, the model is able to correctly recognize small vehicle (shown on the red bounding box in the first row) and identify person on the side walk (shown on the red bounding box in the second row) with merely 5 % labeled points.
Figure 8: Ablation Study. The best combinations are altering labeling units from scans to regions (+Region), applying diversity-aware selection (+Div), and additional region information (+Color/Structure). Best viewed in color. (Sec. 4.3)

4.3 Ablation Studies

We verify the effectiveness of all components in our proposed active selection strategy on S3DIS dataset [1].

First, changing the labeling units from scans to regions contributes the most to the model improvement as shown from the comparison of the purple line (ENT), the yellow line (ENT+Region) and the light blue line (RAND+Region) in Figure 8. After applying region-based selection, the mIoU performance improves more than 10 % under both network architectures.

Furthermore, our diversity-aware selection also plays a key role in the active selection process as can be shown from the comparison of the yellow line (ENT+Region) and the green line (ENT+Region+Div) in Figure 8. Without the aid of this component, the performance of region-based entropy is lower than random region selection under SPVCNN network architecture as shown from the comparison of the yellow line (ENT+Region) and the light blue line (RAND+Region) in Figure 8.

As for adding extra information of color discontinuity and structure complexity, it contributes little to SPVCNN, but is helpful to MinkowskiNet when the percentage of labeled points is larger than 9 % as shown in the comparison of the green line (w/o color and structure) and the red line (w/ color and structure).

Note that as can be seen in Figure 8, the performance of “ENT+Region+Color/Structure” (the dark blue line) is similar to “ENT+Region” (the yellow line). The reason is that without our diversity module, the selected query batch is still full of duplicated regions. This result further validates the importance of our diversity-aware greedy algorithm.

5 Conclusion

We propose ReDAL, a region-based and diversity-aware active learning framework, for point cloud semantic segmentation. The active selection strategy considers both region information and diversity, which concentrates the labeling effort on the most informative and distinctive regions rather than full scenes. This approach can be applied to many deep learning network architectures and datasets, substantially reducing the cost of annotation, and greatly outperforms existing active learning strategies.

Acknowledgement

This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 110-2634-F-002-026, Mobile Drive Technology (FIH Mobile Limited), and Industrial Technology Research Institute (ITRI). We benefit from NVIDIA DGX-1 AI Supercomputer and are grateful to the National Center for High-performance Computing.

References

  • [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3d semantic parsing of large-scale indoor spaces. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1534–1543. Cited by: §A.1, Table 3, Table 4, Figure 2, §1, §1, §1, Figure 5, §4.1, §4.2, §4.3.
  • [2] J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal (2020) Deep batch active learning by diverse, uncertain gradient lower bounds.. In ICLR, Cited by: §2.2.
  • [3] M. Atzmon, H. Maron, and Y. Lipman (2018-07)

    Point convolutional neural networks by extension operators

    .
    ACM Trans. Graph. 37 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.1.
  • [4] D. Bazazian, J. R. Casas, and J. Ruiz-Hidalgo (2015) Fast and robust edge extraction in unorganized point clouds. In 2015 international conference on digital image computing: techniques and applications (DICTA), pp. 1–8. Cited by: §3.2.3.
  • [5] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall (2019) Semantickitti: a dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9297–9307. Cited by: §A.1, Table 5, Table 6, §1, §1, §1, Figure 5, §4.1, §4.2, Table 1.
  • [6] C. Choy, J. Gwak, and S. Savarese (2019)

    4d spatio-temporal convnets: minkowski convolutional neural networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084. Cited by: Appendix B, Table 4, Table 6, §1, §2.1, §4.1.
  • [7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: §1.
  • [8] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In

    international conference on machine learning

    ,
    pp. 1050–1059. Cited by: Appendix B, §2.2, §4.2.
  • [9] Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In International Conference on Machine Learning, pp. 1183–1192. Cited by: Appendix B, Appendix B, §2.2, §4.2.
  • [10] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys (2017) Semantic3d. net: a new large-scale point cloud classification benchmark. arXiv preprint arXiv:1704.03847. Cited by: §1, §1.
  • [11] W. Hsu and H. Lin (2015) Active learning by learning. In

    Twenty-Ninth AAAI conference on artificial intelligence

    ,
    Cited by: §2.2.
  • [12] A. Kirsch, J. van Amersfoort, and Y. Gal (2019) Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems, pp. 7026–7037. Cited by: §2.2.
  • [13] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg (2017) Deep projective 3d semantic segmentation. In International Conference on Computer Analysis of Images and Patterns, pp. 95–107. Cited by: §2.1.
  • [14] Y. Lin, G. Vosselman, Y. Cao, and M. Yang (2020) Efficient training of semantic point cloud segmentation via active learning. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2, pp. 243–250. Cited by: Appendix B, Appendix B, §1, §2.1, §4.2.
  • [15] Z. Liu, H. Tang, Y. Lin, and S. Han (2019) Point-voxel cnn for efficient 3d deep learning. In Advances in Neural Information Processing Systems, pp. 965–975. Cited by: §2.1.
  • [16] H. Luo, C. Wang, C. Wen, Z. Chen, D. Zai, Y. Yu, and J. Li (2018) Semantic labeling of mobile lidar point clouds via active learning and higher order mrf. IEEE Transactions on Geoscience and Remote Sensing 56 (7), pp. 3631–3644. Cited by: §2.1.
  • [17] J. Papon, A. Abramov, M. Schoeler, and F. Worgotter (2013) Voxel cloud connectivity segmentation-supervoxels for point clouds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2027–2034. Cited by: §A.2, §3.2.
  • [18] M. Pauly, R. Keiser, and M. Gross (2003)

    Multi-scale feature extraction on point-sampled surfaces

    .
    In Computer graphics forum, Vol. 22, pp. 281–289. Cited by: §3.2.3.
  • [19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, §2.1.
  • [20] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30, pp. 5099–5108. Cited by: §1, §2.1.
  • [21] N. Roy and A. McCallum (2001) Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, pp. 441–448. Cited by: §2.2.
  • [22] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman (2008) LabelMe: a database and web-based tool for image annotation. International journal of computer vision 77 (1-3), pp. 157–173. Cited by: §1.
  • [23] O. Sener and S. Savarese (2018) Active learning for convolutional neural networks: a core-set approach. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B, Appendix B, §2.2, §4.2.
  • [24] B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §2.2.
  • [25] C. E. Shannon (1948) A mathematical theory of communication. The Bell system technical journal 27 (3), pp. 379–423. Cited by: Appendix B, §2.2.
  • [26] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han (2020) Searching efficient 3d architectures with sparse point-voxel convolution. In European Conference on Computer Vision, pp. 685–702. Cited by: Table 3, Table 5, §1, §2.1, §4.1.
  • [27] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) Kpconv: flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6411–6420. Cited by: §1, §2.1.
  • [28] D. Wang and Y. Shang (2014) A new active labeling method for deep learning. In 2014 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 112–119. External Links: Document Cited by: Appendix B, Appendix B, Appendix B, Appendix B, §2.2, §2.2, §3.2.1, §4.2.
  • [29] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin (2016) Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology 27 (12), pp. 2591–2600. Cited by: Appendix B, §2.2, §3.2.1.
  • [30] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan (2019) Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10296–10305. Cited by: §1, §2.1.
  • [31] J. Wei, G. Lin, K. Yap, T. Hung, and L. Xie (2020) Multi-path region mining for weakly supervised 3d semantic segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4384–4393. Cited by: §2.1.
  • [32] B. Wu, A. Wan, X. Yue, and K. Keutzer (2018) Squeezeseg: convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893. Cited by: §2.1.
  • [33] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4376–4382. Cited by: §2.1.
  • [34] X. Xu and G. H. Lee (2020) Weakly supervised semantic point cloud segmentation: towards 10x fewer labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13706–13715. Cited by: §1, §2.1.

Appendix A Implementation Details

As explained in the main paper, the pipeline of our ReDAL contains four steps: (1) Train the deep learning model in supervision with labeled dataset . (2) Calculate region information score using softmax entropy, color discontinuity, and structure complexity. (3) Diversity-aware selection by penalizing visually similar regions appearing in the same querying batch. (4) The top-ranked regions are labeled by annotators and added to the labeled dataset . This section explains the implementation details of the first three steps, and the fourth step has been explained clearly in the main paper. Note that the following symbols are the same as those in Section 3 of the main paper.

a.1 Network Training

For both S3DIS [1] and SemanticKITTI [5] dataset, the networks are trained with Adam optimizer (initial learning rate = ) and cross-entropy loss. We train the network on 8 V100 GPUs and set the batch size to 16. We set voxel resolution to 5cm for both datasets.

On the S3DIS dataset, the deep learning model was trained for 200 epochs on 3% of the initial fully labeled point cloud scan and then fine-tuned for 150 epochs after adding 2% labeled data each time for both network architecture backbones. On the SemanticKITTI dataset, the deep learning model was trained for 100 epochs on 1% of the initial fully labeled point cloud scan and then fine-tuned for 30 epochs after adding 1% labeled data each time for both network architecture backbones.

a.2 Region Information Estimation

We utilize VCCS algorithm [17] to divide a 3D scene into multiple sub-scene regions. In the algorithm, the whole 3D space is initially divided into multiple regions with two hyper-parameter , where indicates the initial distance between regions and represents the minimal region resolution. After that, the clustering procedure adjusts the region boundary based on spatial or color connectivity iteratively. For the S3DIS dataset, we set to a small value () since objects in an indoor scene are small. On the other hand, we set to a large value () for the SemanticKITTI dataset because the point cloud is quite sparse in such a large outdoor 3D space, and choosing larger parameters () can avoid creating small, unrepresentative regions. We show an example of divided sub-scene regions in the SemanticKITTI dataset in Figure 9.

Figure 9: Visualization of divided sub-scene regions in SemanticKITTI dataset. Points of the same color in neighboring places belong to the same region.

As mentioned in Section 3.2 of the main paper, we linearly combine softmax entropy, color discontinuity, and structural complexity as region information score. For color discontinuity and structural complexity, we calculate color differences and surface variation for each point and its -nearest neighbors ( in both datasets). As for the weight of the linear combination of these three terms, which is described in Equation 4 (line 468) of the main paper, we set for S3DIS dataset and for SemanticKITTI dataset. Note that the value in Eq. 4 is empirically decided for we found that model uncertainty is much more important than the color discontinuity and structural complexity terms. In addition, since SemanticKITTI dataset does not have point-by-point color information, we set for the dataset.

a.3 Diversity-aware Selection

As explained in Section 3.3 of the main paper, we measure the similarity of these regions by clustering their corresponding region features. We set the number of clusters of all regions for the S3DIS and SemanticKITTI datasets, respectively. For both datasets, we set the decay rate . Note that our diversity-aware selection algorithm does not create too much computational burden. On the SemanticKITTI dataset, our diversity-aware selection algorithm only takes only 0.58 ms per region on average. Note that we empirically found that k in k-nn, decay rate and the number of clusters is not sensitive to the experimental results, where all values are determined via grid search.

Appendix B Baseline Active Learning Methods

In this section, we describe the implementation of the baseline active learning methods used in our experiments.

Random selection (RAND)

Randomly select a portion of point cloud scans in the unlabeled dataset for label acquisition. The strategy is commonly used as the baseline for active learning methods [28, 9, 23, 14].

Margin sampling (MAR)

Some previous active learning methods query instances with the smallest model decision margin, which is the predicted probability difference between the two most likely class labels [28]. As shown in Eq. 5, given a point cloud scan with points and fixed model parameter , we calculate the difference between between the two most likely class labels for all points, and produce the score for a point cloud scan () by averaging the value of all points in a scan. After that, we select a portion of point cloud scans with the largest score in the unlabeled dataset for label acquisition.

(5)

where is the first most probable label class and is the second most probable label class.

Least confidence sampling (CONF)

Many previous active learning methods query the sample whose prediction is the least confidence [28, 29]. As can be observed in Eq. 6, given a point cloud scan with points and fixed model parameter , we calculate the confidence of predicted class label () for all points, and produce the score for a point cloud scan () by averaging the value of all points in a scan. After that, we select a portion of point cloud scans with the least confidence score in the unlabeled dataset for label acquisition.

(6)
Softmax entropy (ENT)

Entropy is an indicator to measure the information of a probability distribution in the information theory

[25]. Some previous active learning approaches query samples with the highest entropy value in the predicted probability [28]. As shown in Eq. 7, given a point cloud scan with points and fixed model parameter , we calculate the softmax entropy value for all points and produce the score for a point cloud scan () by averaging the value of all points in a scan. After that, we select a portion of point cloud scans with the largest entropy in the unlabeled dataset for label acquisition.

(7)

where represents the total number of labels, and represents the probability that the model predicts point as class .

Core-Set (CSET)

Sener [23] proposed a purely diversity-based deep active selection strategy named Core-Set. The strategy aims to select a small subset so that a model trained on the selected subset has a similar performance to that trained on the whole dataset. The method first extracts the feature of each sample. Then, it selects a small number of samples from the unlabeled dataset that is the furthest away from the labeled dataset in the feature space for label acquisition. In the implementation, we choose the middle layer of the encoder-decoder network as the feature.

Segment entropy (SEGENT)

Lin [14] proposed segment entropy to measure the point cloud scanning information in the deep active learning pipeline. This method assumes that each geometrically related area should share similar semantic annotations. Therefore, it calculates the entropy of the distribution of predicted labels in a small area to estimate model uncertainty.

MC-Dropout (MCDR)

[8, 9] combined Bayesian active learning with deep learning, which estimated model uncertainty by Monte Carlo Dropout. In the implementation, we set the dropout rate to 0.3 and perform 10 dropout predictions. Note that since there is no dropout layer in MinkowskiNet [6], we did not compare with this baseline when using MinkowskiNet.

Appendix C Experimental Result

% Labeled Data RAND MAR CONF ENT CSET SEGENT MCDR ReDAL (Ours)
init. 27.05 28.29 28.60 27.92 28.89 29.16 28.33 27.86
5 31.39 30.07 32.14 31.02 33.24 34.55 29.30 41.27
7 35.37 31.34 33.76 35.10 36.59 40.97 33.68 47.68
9 40.51 33.30 38.57 40.90 37.02 42.30 40.00 52.34
11 44.50 39.75 40.60 41.51 41.42 43.07 41.65 54.28
13 46.28 40.41 42.43 43.42 41.34 44.48 44.04 57.01
15 49.02 40.45 44.44 45.06 41.40 45.04 45.06 57.97
Table 3: Results of IoU performance (%) on S3DIS [1] with SPVCNN [26].
% Labeled Data RAND MAR CONF ENT CSET SEGENT ReDAL (Ours)
init. 26.59 25.20 25.52 26.60 25.60 26.30 25.63
5 30.22 25.87 27.81 27.60 35.58 26.66 39.45
7 34.76 32.40 30.25 28.91 38.88 30.45 44.29
9 38.79 36.20 32.23 35.40 40.41 39.72 50.50
11 43.80 41.31 38.39 37.10 41.28 41.95 55.11
13 46.13 42.28 42.10 37.42 43.63 44.66 56.14
15 48.57 43.15 42.18 40.37 47.26 45.79 57.26
Table 4: Results of IoU performance (%) on S3DIS [1] with MinkowskiNet [6].
% Labeled Data RAND MAR CONF ENT CSET SEGENT MCDR ReDAL (Ours)
init. 41.84 42.39 42.98 41.90 42.19 43.18 42.92 41.87
2 45.41 46.84 46.31 45.57 46.98 47.89 47.57 51.70
3 52.19 49.55 50.15 51.42 52.93 52.60 50.08 55.83
4 54.76 51.66 54.46 51.85 54.57 53.60 53.56 56.86
5 56.89 53.21 55.41 56.45 56.45 54.00 54.40 58.18
Table 5: Results of IoU performance (%) on SemanticKITTI [5] with SPVCNN [26].
% Labeled Data RAND MAR CONF ENT CSET SEGENT ReDAL (Ours)
init. 37.74 38.20 37.32 37.33 36.86 37.75 37.48
2 42.74 42.73 42.01 42.16 41.25 42.62 48.88
3 48.82 45.07 47.37 45.77 45.15 49.51 55.30
4 52.51 47.84 49.54 49.46 49.93 51.87 58.35
5 54.67 51.27 53.49 52.34 51.89 53.12 59.76
Table 6: Results of IoU performance (%) on SemanticKITTI [5] with MinkowskiNet [6].

Due to space limitations, we show the original experimental results here, which are shown in the line charts of the main paper. Table 3, 4, 5, 6 shows the original data of Figure 5 in the main paper. Table 7, 8 present the original data of Table 1, 2 in the main paper.

mIoU

car

bicycle

motorcycle

truck

other-vehicle

person

bicyclist

motorcyclist

road

parking

sidewalk

other-ground

building

fence

vegetation

trunk

terrain

pole

traffic-sign

method
Full 61.4 95.9 20.4 63.9 70.3 45.5 65.0 78.5 0.4 93.5 50.6 82.0 0.2 91.2 63.8 87.2 68.5 74.3 64.4 50.1
RAND 54.7 94.7 9.5 45.0 66.8 38.6 52.0 47.8 0.0 90.2 38.5 76.1 1.8 88.3 55.5 87.9 64.0 76.5 60.2 45.6
ReDAL 59.8 95.4 29.6 58.6 63.4 49.8 63.4 84.1 0.5 91.5 39.3 78.4 1.2 89.3 54.4 87.4 62.0 74.1 63.5 49.7
Table 7: Results of IoU performance (%) with only labeled points. The table shows that our ReDAL achieve better result on most classes compared with baseline random selection. For some classes of small objects and objects with complex boundaries, our ReDAL greatly surpass the random selection baseline and even outperform fully supervised result, such as bicycle and bicyclist.

total

car

bicycle

motorcycle

truck

other-vehicle

person

bicyclist

motorcyclist

road

parking

sidewalk

other-ground

building

fence

vegetation

trunk

terrain

pole

traffic-sign

method
Full 43.68 0.17 0.41 2.02 2.40 0.36 0.13 0.04 205.22 15.19 148.59 4.03 137.00 74.69 275.57 6.23 80.67 2.95 0.63
RAND 43.89 0.14 0.34 3.51 2.12 0.42 0.11 0.05 206.86 14.07 147.32 4.02 137.63 74.47 274.47 6.21 80.54 3.02 0.73
ReDAL 33.71 0.25 0.51 8.01 11.36 1.27 0.21 0.07 168.16 20.15 145.77 16.92 132.22 78.68 252.65 9.25 114.45 4.48 1.87
Table 8: Labeled Class Distribution Ratio (). With limited annotation budgets, our active method ReDAL queries more labels on small objects like person and bicycle but less on large uniform areas like road and vegetation. The selection strategy can mitigate the label imbalance problem and improve the performance on more complicated object scenes without hurting much on large areas as shown in Table 7.