Few-Shot Point Cloud Region Annotation with Human in the Loop

by   Siddhant Jain, et al.

We propose a point cloud annotation framework that employs human-in-loop learning to enable the creation of large point cloud datasets with per-point annotations. Sparse labels from a human annotator are iteratively propagated to generate a full segmentation of the network by fine-tuning a pre-trained model of an allied task via a few-shot learning paradigm. We show that the proposed framework significantly reduces the amount of human interaction needed in annotating point clouds, without sacrificing on the quality of the annotations. Our experiments also suggest the suitability of the framework in annotating large datasets by noting a reduction in human interaction as the number of full annotations completed by the system increases. Finally, we demonstrate the flexibility of the framework to support multiple different annotations of the same point cloud enabling the creation of datasets with different granularities of annotation.



There are no comments yet.


page 3

page 4


Point Cloud Pre-training by Mixing and Disentangling

The annotation for large-scale point clouds is still time-consuming and ...

Point Cloud Colorization Based on Densely Annotated 3D Shape Dataset

This paper introduces DensePoint, a densely sampled and annotated point ...

LATTE: Accelerating LiDAR Point Cloud Annotation via Sensor Fusion, One-Click Annotation, and Tracking

LiDAR (Light Detection And Ranging) is an essential and widely adopted s...

Compositional Prototype Network with Multi-view Comparision for Few-Shot Point Cloud Semantic Segmentation

Point cloud segmentation is a fundamental visual understanding task in 3...

"Zero Shot" Point Cloud Upsampling

Point cloud upsampling using deep learning has been paid various efforts...

Dynamic Adaptive Point Cloud Streaming

High-quality point clouds have recently gained interest as an emerging f...

End-to-End Intelligent Framework for Rockfall Detection

Rockfall detection is a crucial procedure in the field of geology, which...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Two dimensional images have been the most popular digital representation of the world however, point cloud data is increasingly gaining center stage with applications in autonomous driving, robotics and augmented reality. While synthetic point cloud datasets have been around for some time (Chang et al., 2015a), prevalance of depth cameras such as (Keselman et al., 2017) and (Zhang, 2012) has led to creation of large 3D datasets (Choi et al., 2016) created from applying techniques from (Newcombe et al., 2011) on depth scans. Finally, we have also seen a number of point cloud datasets created using LIDAR scans of outdoor environments such as (Hackel et al., 2017), (Behley et al., 2019).

The intensity and geometric information in point clouds provide a more detailed digital description of the world than images but their value in algorithmic analysis is fully realised when the points have an associated semantic label. However, annotating 3D point clouds is a time-consuming and labour intensive process owing to the size of the datasets and the limitations of the 2D interfaces in manipulating 3D data.

The problem of providing a label of each point in a point cloud has been tackled via a host of fully automatic approaches in the domain of point cloud segmentation (Tchapmi et al., 2017) (Landrieu & Simonovsky, 2017)

. While these approaches are successful in delineating large structures such as buildings, roads and vehicles, they perform poorly on finer details in the 3D models. Besides, most of these approaches use supervised learning methods which in-turn rely on labelled datasets making it a chicken and egg problem.

Thus, most existing datasets (Behley et al., 2019) (Hackel et al., 2017) (Roynard et al., 2018) have been annotated via dominantly manual systems to ensure accuracy and to avoid algorithmic biases in the produced datasets. The large investment required in terms of human effort in generating the annotations severely limit both the significance and prevalence of point cloud datasets that are available for the community.

Annotating large scale datasets is a natural use case for fusing human and algorithmic intelligence. Annotations inherently rely on a human definition and are also representative of semantic patterns that can be identified by an algorithm. Thus, we observe an active field of research which seeks to fuse human and algorithmic actors in one overarching framework to aid in annotating datasets. Most notable in the context of point cloud annotation is (Yi et al., 2016) which proposes an active learning framework for region annotations in datasets with repeating shapes. However, this method is limited by allowing for annotating only in certain 2D views the point clouds. Our proposed framework allows for annotation in full 3D thus allowing for finer annotation of the point cloud and the ability to work with less structured real-world point clouds as opposed to relatively noise-free synthetic point clouds.

Figure 1: This figure illustrates our pipeline. The first step starts with a partial sparse annotation by a human, followed by a region growing step using 3D geometric cues. We then iterate between few-shot learning using newly available annotations and sparse correction of predictions via human annotator to obtain final segmentation outputs.

In this work, we propose a human-in-loop learning approach that fuses together manual annotation, algorithmic propagation and capitalises on existing 3D datasets for improving semantic understanding. Our method starts with a partial sparse annotation by a human, followed by a region growing step using 3D geometric cues. We then iterate over the following steps: a) Model fine tuning using newly available annotations b) Model prediction of annotated point cloud c) Sparse correction of predictions via human annotator. Figure 1 gives a snapshot of our annotation approach.

In the next couple of sections we go over our methodology in more details followed by a discussion of our results and future work.

2 Methodology

This system is primarily focused on providing an annotation framework to create datasets of point clouds with ground truth semantic labels for each point. For a given point cloud, our method starts with sparse manual annotation and then iterates between two main steps: few-shot learning and manual correction. The manual annotations are provided by marking few representative points for each part to be labelled in the point cloud. These labels are propagated across the point cloud using geometric cues which is used to train the network. The final step involves correcting network mispredictions, which is used to further guide the training process. For the initial point clouds to be annotated, these steps are iterated over multiple times but as more point clouds are annotated using this framework, the method converges to relying only on the initial set of manual annotations (or no annotations at all) to make more accurate annotation predictions.

2.1 Manual Annotations and Region Growing

The decomposition of a point cloud into semantically meaningful parts/regions is an open-ended problem as the concept of an annotation is context dependent (Benhabiles et al., 2009). Owing to this ambiguity, the first step of our annotation pipeline is to allow the user to determine the number of possible classes that exist in the segmentation of point clouds in the dataset. The framework of annotation, learning and correction also provides the flexibility to have different number of segmentations for the same point cloud allowing for creating datasets with varying granularities of segmentation as in (Mo et al., 2018)

. The user initially provides labels to a point or a small group of points for each of the classes in the point cloud. Thereafter, human provided annotations are automatically propagated to few unlabelled points by exploiting geometry of the point cloud. We believe that relying on geometric attributes like surface normals, smoothness, curvature and color (if available) would simplify the goal of segmentation as decomposing the point cloud into locally smooth regions enclosed by sharp boundaries. These segmentations also often end up matching with human perception and can be used as an initial training example for the learning pipeline. For this reason, we use cues like surface normals to group spatially close points as belonging to the same region. We also experimented with color based region growing, K-Nearest Neighbour (KNN) and Fixed Distance Neighbor (FDN)

(Mishra, 1997) based region growing methods which end up being faster than surface normal based region growing methods without compromising on accuracy of the overall system. Region growing approaches reduce annotator overhead of selecting multiple points by giving a geometry aware selection mechanism.

2.2 Few-shot Learning

The goal of few-shot learning optimization in this context is to rely on minimal human supervision to improve segmentation accuracy. It is for this reason that we obtain the initial set of ground truth labels for training from manual annotations and use region growing methods for further supervision. We use very conservative thresholds for the region growing methods to avoid noisy ground truth labels. We also use a pre-trained network to bootstrap the training process and reduce the amount of human effort in correction phase.

The pre-trained network to be used in this system can be any segmentation network, pre-trained on an existing dataset in a similar domain. For our experiments, we used PointNet (Qi et al., 2016) pre-trained on ShapeNet (Chang et al., 2015b) to bootstrap the training.

We fine-tune the base network iteratively using limited supervision in the form of annotation and correction provided by human in the loop. We also dynamically adapt the base network depending on the number of segmentation classes in the point cloud. The initial seed acquired from manual annotation and region growing gives a partially labelled point cloud that is used to fine-tune the base network. The model leverages the prior semantic understanding in the pre-trained network alongside the supervision of partially labelled points in the entire point cloud to provide meaningful segmentations in the first stage. We rely on the human annotators to compensate for network mispredictions by assigning new label to points with incorrect segmentations. Subsequently, we fine-tune further with all the labels (initial seed + corrections) that the human annotator has provided so far. This process continues until all points are labelled correctly - as verified by the human annotator. At this stage, we retrain the network with all the points in the point cloud - which allows us propagate these labels to newer point clouds of the same shape in the dataset. Figure 2 illustrates a sample of the results from this loop of user feedback and finetuning.

Figure 2: Figure to show effects of few-shot learning in 3 class segmentation of a chair. From left to right i) Manual annotation with region growing ii) Predictions of the network after fine tuning. Notice the spillage of labels at the boundary which is resolved after correction and final learning step iii) Partially corrected point cloud from the user iv) Final prediction after fine tuning with corrections

2.2.1 Smoothness Loss

We formulate segmentation as a per-point classification problem similar to the setup of PointNet (Qi et al., 2016)

including global and local feature aggregation. We also use transformation network to ensure that the network predictions are agnostic to rigid transformations of point cloud. We further leverage smoothness of the shape to favor regions that are compact and continuous. Overall, the network loss can be formulated as:


We use smoothness loss in addition to the segmentation cross entropy loss to encourage adjacent points to have similar labels. The smoothness loss is formulated as follows:


The smoothness term is computed as pairwise Kullback-Leibler (KL) divergence of predictions exponentially weighted on eucledean distance between any two points in the point cloud.

is set to the variance of pairwise distance between all points to capture point cloud density in the loss term. The smoothness term in this context is expected to capture and minimize relative entropy between neighboring points in the point cloud. This term ends up dominating total loss if nearby points have divergent logits. Points which are far from each other end up contributing very little to the overall loss term, regardless of their logits owing to high pairwise distances between them.

Figure 3 shows a qualitative example for the effect of the smoothness loss.

Figure 3: Illustration to show effect of the smoothness loss. From left to right i) Manual annotation with region growing ii) Predictions of the network without smoothness loss iii) Network predictions with smoothness loss

The first stage of segmentation output requires less human cognitive effort for correction if the smoothening term is added to loss computation as it has been observed through our experiments. The weights of smoothness loss term is subsequently dropped after getting further supervision from the user.

3 Results

In this section, we discuss the experimental setup to validate the effectiveness of our framework by investigating its utility to create new datasets against completely manual or semi-automatic methods. Additionally, we have also investigated the improvement in annotation efficiency as the total number of annotated point clouds increase.

Dataset. To test the robustness and ease of adapting to our framework, we aim to use it to create a massive and diverse dataset of synthetic and reconstructed point clouds. Towards this goal, we have created part segmentations of reconstructed point clouds taken from A Large Dataset of Object Scans (Choi et al., 2016). Qualitative results for segmentations are shown in Figure 4. The framework showed remarkable improvement in human annotation efficiency as measured in number of clicks required for manual annotation and correction which is discussed in subsequent parts of this section.

Figure 4: Qualitative results for part segmentation on reconstructed point clouds from Large Dataset of Object Scans (Choi et al., 2016). The results are shown for segmentation of noisy shapes in potted plant and chair class into two and three classes respectively using our framework.

Granularity. The framework also provides the flexibility to annotate with different number of classes for the same shape. The user selects sparse points for each of the classes in the first stage and this information is dynamically incorporated in the training process by re-initializing the last layer to accommodate different number of classes. Qualitative segmentation outputs are illustrated in Figure 5.

Figure 5: Qualitative results for part segmentation on the same shape with different granularities.

3.1 Annotation Efficiency Improvements

Existing semi-supervised methods (Yi et al., 2016)

use amount of supervision and accuracy as evaluation metrics to measure performance. We follow suit and compare the amount of supervision needed to completely annotate a point cloud in our framework as opposed to completely manual methods. In Table

1 we compare the number of clicks by the annotator required in our framework compared to a naive nearest neighbour painting based manual approach.

Table 1: Average number of clicks taken to annotate point clouds with varying granularities in terms of number of parts for the same shape. We notice a significant reduction in number of clicks in comparison to manual methods and even our method without the smoothness constraint.

With subsequent complete annotations of point clouds coming from the same dataset, we expect a reduction in the human supervision needed in order to have a scalable system. As we incrementally train the network on a progressively complete annotation of the point clouds, the model adapts to the properties of the new domain represented by the dataset. Thus, we are able to predict a more accurate segmentation of the point cloud in the initial iterations thereby cutting down on the total number of user correction steps needed. This is validated via our experiments as illustrated in Figure 6.

Figure 6: Number of clicks taken to annotate subsequent point clouds using our framework. We see a reduction in number of clicks needed as more point clouds of the new dataset are annotated.

While previous works measure the amount of user supervision based on invested time, we focused on quantifying supervision via number of clicks. Through our experiments we observed that time taken to annotate a point clouds reduces with the number of point clouds annotated even in completely manual methods. This is because a large part of the time taken in annotating goes in manipulating the point clouds on a 2D tool. As the annotators label more point clouds, they get more accustomed to the tool and the relevant manipulation interactions, reducing the overall time they need in annotating subsequent point clouds. On the other hand, the number of clicks needed depend more on complexity of the point cloud and the number of classes to be annotated instead of the number of previous point clouds annotated in the system making it a suitable metric for evaluation.

4 Conclusion

We provide a scalable interactive learning framework that can be used to annotate large point cloud datasets. By fusing together three different cues (human annotations, learnt semantic similarity and geometric consistencies) we are able to obtain accurate annotations with fewer human interactions. We note that while the number of clicks are a useful proxy for the quantum of human interaction needed, it is also important to study the amount of time needed for each click as it adds to overall human time investment needed in annotating a dataset. Significant leaps in reducing the cognitive overload for a human annotator can be made by replacing 2D user interfaces with spatial user interfaces facilitated via virtual reality systems as they make point cloud manipulation and visualization more natural of the annotator.