Primitive Fitting Using Deep Boundary Aware Geometric Segmentation

10/03/2018 ∙ by Duanshun Li, et al. ∙ University of Alberta NYU college 12

To identify and fit geometric primitives (e.g., planes, spheres, cylinders, cones) in a noisy point cloud is a challenging yet beneficial task for fields such as robotics and reverse engineering. As a multi-model multi-instance fitting problem, it has been tackled with different approaches including RANSAC, which however often fit inferior models in practice with noisy inputs of cluttered scenes. Inspired by the corresponding human recognition process, and benefiting from the recent advancements in image semantic segmentation using deep neural networks, we propose BAGSFit as a new framework addressing this problem. Firstly, through a fully convolutional neural network, the input point cloud is point-wisely segmented into multiple classes divided by jointly detected instance boundaries without any geometric fitting. Thus, segments can serve as primitive hypotheses with a probability estimation of associating primitive classes. Finally, all hypotheses are sent through a geometric verification to correct any misclassification by fitting primitives respectively. We performed training using simulated range images and tested it with both simulated and real-world point clouds. Quantitative and qualitative experiments demonstrated the superiority of BAGSFit.



There are no comments yet.


page 1

page 4

page 7

Code Repositories


Primitive Fitting Using Deep Boundary Aware Geometric Segmentation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

“Treat nature by means of the cylinder, the sphere, and the cone.” — Paul Cézanne, 1904

Not only in art, the idea of decomposing a scene or a complex object into a set of simple geometric primitives for visual object recognition dates back as early as 1980s when Biederman proposed the object Recognition-By-Components theory [1], in which primitives were termed “geons”. Although some real scenes can be more complicated than simple combinations of “geons”, there are many useful ones that can be efficiently modeled for the purpose of robotics: planes in man-made structures, utility pipelines as cylinders, household objects such as paper cups, and more interestingly, a robot itself, often as an assembly of simple primitives.

Thus, for better extro- and intro-spection to improve the intelligence of all kinds of robots, from autonomous cars to service robots, it is beneficial to robustly detect those primitives and accurately estimate the associated parameters from noisy 3D sensor inputs, such as robotic manipulation that requires poses and shapes of objects [2], SLAM that takes advantage of primitives (mostly planes) for better mapping accuracy [3, 4, 5], reverse engineering that models complex mechanical parts as primitives [6], and similarly as-built Building Information Modeling [7, 8].

Boundary Plane Sphere Cylinder Cone Other
Fig. 1: Primitive fitting on a simulated test range image (top left) with BAGSFit (middle right) vs. RANSAC (top right) [9]. Estimated normals (middle left) and ground truth labels (bottom left) are used to train a fully convolutional segmentation network in BAGSFit. During testing, a boundary-aware and thus instance-aware segmentation (bottom right) is predicted, and sent through a geometric verification to fit final primitives (randomly colored). Comparing with BAGSFit, the RANSAC-based method produces more misses and false detections of primitives (shown as transparent or wire-frame), and thus a less appealing visual result.

This primitive fitting is a classic chicken-and-egg problem: with given primitive parameters, point-to-primitive (P2P) membership can be determined by nearest P2P distance; and vise versa by robust estimation. The challenge comes when multiple factors present together: a noisy point cloud (thus noisy normal estimation), a cluttered scene due to multiple instances of a same or multiple primitive models, and also background points not explained by the primitive library. See Figure 1 for an example. The seminal RANSAC-based method [9] often tends to fit inferior primitives that do not well represent the real scene.

Different from existing work for this multi-model multi-instance fitting problem, we are inspired by human visual perception of 3D primitives. As found by many cognitive science researchers, human “observers’ judgments about 3D shape are often systematically distorted” [10]. For example, when looking at a used fitness ball, many people would think of it as a sphere, although it could be largely distorted if carefully measured. This suggests that human brain might not be performing exact geometric fitting during primitive recognition, but rather rely on “qualitative aspects of 3D structure” [10] from visual perception.

Due to the recent advancements in image semantic segmentation using convolutional neural networks (CNN) [11, 12], it is then natural to ask: whether CNN can be applied to this problem with geometric nature and find the P2P membership as segmentation on the range image without geometric fitting at all? Our answer is yes, which leads to the BAGSFit framework that reflects this thought process.

I-a Contributions

This paper contains the following key contributions:

  • We present a methodology to easily obtain point-wise ground truth labels from simulated dataset for supervised geometric segmentation, demonstrate its ability to generalize to real-world dataset, and released the first simulated dataset 111The dataset is publicly available at for development and benchmarking.

  • We present a novel framework for multi-model 3D primitive fitting, which performs both qualitatively and quantitatively superior than RANSAC-based methods on noisy range images of cluttered scenes.

  • We introduce this geometric segmentation task for CNN with several design analyses and comparisons.

I-B Related Work

Depending on the basic assumption of the number of primitive classes, previous work can be roughly grouped into multi-instance vs. multi-model, described as follows.

Multi-Instance Fitting: As a subset of multi-model fitting problem, it typically assumes that a scene is mainly composed of a single class of primitive model, such as plane detection by normal-based region grow [13] or agglomerative clustering [14], and sphere detection by hough transform [15] or mean-shift clustering [16]. Benefiting from the simple assumption, such methods can work in real-time and obtain accurate results. Yet when the assumption is violated, they often generate many false detections (e.g., a curved surface fitted as multiple small planes), which requires careful threshold tuning to be filtered out.

Multi-Model Fitting: By assuming the existence of potentially multiple classes of primitives in a scene, it is more realistic than the previous group, and thus more challenging when a cluttered scene is observed with noisy 3D sensors. Previous work with this assumption can be roughly grouped further into the following three categories:

Segmentation: These methods are rooted in the idea of segmentation of a point cloud into individual clusters, while the model classification and fitting are either performed during the segmentation or afterwards. For example, an early method simultaneous grows multiple seed regions with dynamic primitive model selection by iterative regression [17], but was not shown to necessarily work for noisy cluttered scenes. A more recent work in reverse engineering [18]

assumes the input 3D mesh has been previously segmented into parts, and then classifies each part based on the Gaussian sphere, i.e. the normal space. The data quality in this work is much better than that of the more prevalent Kinect-like range images, thus might not be suitable for noisy and cluttered data in robotics. Another recent real-time algorithm 

[19] fits conic curves in each scan-line and then merges neighboring curves into primitives. Maybe due to miss detections of smaller curves, this first 2D then 3D approach has not been shown to work robustly in cluttered scene. While sharing the same basic segmentation idea as above, our geometric segmentation CNN performs no fitting, but only provides plausibility maps for further geometric verification if needed, or even just as shape priors to more complex scene reconstruction [20], e.g., tree trunks as cylindrical shapes.

RANSAC: Since the seminal work by Schnabel et al. [9], the idea of selecting sampled primitive hypotheses to maximize some scoring functions becomes a default solution to this problem, serving as our baseline method. It stimulates several variations including the GlobFit exploring spatial constraints between primitives for regularization [6]

. Note in primitive fitting practice, the 3D sensor noise is often more structured (e.g., depth dependent noises for range images) than uniform or Gaussian in 3D as experimented in many of these papers. What really makes the problem difficult is that those noisy points belonging to other partially occluded primitive instances become outliers of the primitive to be fit at hand, causing false detections of “ghost” primitives not existed in the real scene but still with very small fitting errors and large consensus scores, e.g. the ghost cones fitted with cylinder and background points in the top right of Figure 


. More recently, prior probabilities or quality measure of the data 

[21, 22] were used to improve the probability of sampling an all-inlier subset. Others explored the use of spatial consistency between the data [23, 24] to speed-up the hypothesis generation process. RANSAC has also been generalized as a set coverage problem [25] or extended from a duality perspective as preference analysis or residual sorting and its variants [26, 27, 28, 29]. While being theoretically interesting with good performances on classic multi-instance 2D fitting tasks such multi-homography detection, we are not aware of any of those methods that are explicitly shown to work well in the setting of multi-modal geometric primitive fitting from cluttered and noisy 3D range images.

Energy Minimization: Unlike the sequential and greedy nature of RANSAC based methods, it is appealing in theory to define a global energy function in terms of P2P membership that once minimized results in desired solution [30, 31, 32, 33]. However most of them are only shown on relatively small number of points of simple scenes without much clutters or occlusions, and it is unclear how they will scale to larger datasets due to the intrinsic difficulty and slowness of minimizing the energy function.

(a) CNN-based Segmentation

(b) Geometric Verification

(c) Fitted Primitives
Fig. 2: BAGSFit overview. In 1(a), a proper form of a range image, e.g., its normal map, is input to a fully convolutional neural network for segmentation. We use the same visualization style for the CNN as in [12], where each block means layers sharing a same spatial resolution, decreasing block height means decimating spatial resolution by a half, and red dashed lines means loss computation. The black dashed line is only applied for joint boundary detection with multi-binomial loss where low-level edge features are expected to be helpful if skip-concatenated for the final boundary classification. The resulting segmentation probability maps (top row of 1(b), darker for higher probability) for each primitive class are sent through a geometric verification to correct any misclassification by fitting the corresponding class of primitives (bottom row of 1(a)). Finally, fitted primitives are shown in 1(c). Without loss of generality, this paper only focuses on four common primitives: plane, sphere, cylinder, and cone.

Ii Framework Overview

Figure 2 gives a visual overview of the multi-model primitive fitting process by our BAGSFit framework. As introduced above, the front-end of this framework (Figure 1(a)) mimics the human visual perception process in that it does not explicitly use any geometric fitting error or loss in the CNN. Instead, it takes advantage of a set of stable features learned by CNN that can robustly discriminate points belonging to different primitive classes. The meaning of a pixel of the output probability map (top row of Figure 1(b)) can be interpreted as how much that point and its neighborhood look like a specific primitive class, where the neighborhood size is the CNN receptive field size.

Such a segmentation map could already be useful for more complex tasks [20], yet for the sake of a robust primitive fitting pipeline, one cannot fully trust this segmentation map as it inevitably contains misclassification, just like all other image semantic segmentations. Fortunately, by separating pixels belonging to individual primitive classes, our original multi-model problem is converted to an easier multi-instance problem. Following this segmentation, a geometric verification step based on efficient RANSAC [9]

incorporates our strong prior knowledge, i.e., the mathematical definitions of those primitive classes, to find the parametric models of the objects for each type of primitives. Note that RANSAC variants using prior inlier probability to improve sampling efficiency are not adopted in this research, because 1) they are orthogonal to the proposed pipeline; and 2) the robustness of primitive fitting is highly dependent on the spatial distribution of samples. Different from spatial consistency based methods 

[23, 24] mainly dealing with homography detection, in our 3D primitive fitting task, samples with points very close to each other usually lead to bad primitive fitting results [9]. Thus the potential of using the CNN predicted class probabilities to guide the sampling process, while being interesting, will be deferred for future investigations.

The advantage for this geometric segmentation task is that exact spatial constraints can be applied to detect correct primitives even with noisy segmentation results. One could use the inliers after geometric verification to correct the CNN segmentation results, similar to the CRF post-processing step in image semantic segmentation that usually improves segmentation performance.

Iii Ground Truth from Simulation

Before going to the details of our segmentation CNN, we need to first address the challenge of preparing training data, because as most state-of-the-art image semantic segmentation methods, our CNN needs to be trained by supervision. To our best knowledge, we are the first to introduce such a geometric primitive segmentation task for CNN, thus there is no existing publicly available datasets for this task. For image semantic segmentation, there have been many efforts to use simulation for ground truth generation. Yet it is hard to make CNNs trained over simulated data generalize to real world images, due to intrinsic difficulties of tuning a large number of variables affecting the similarities between simulated images and real world ones.

However, since we are only dealing with geometric data, and that 3D observation is less sensitive to environmental variations, plus observation noise models of most 3D sensors are well studied, we hypothesize that simulated 3D scans highly resemble real world ones such that CNNs trained on simulated scans can generalize well to real world data. If this is true, then for this geometric task, we can get infinite number of point-wise ground truth almost for free.

Although saved from tedious manual labeling, we still need a systematic way of generating both random scene layouts of primitives and scan poses so that simulated scans are meaningful and covers true data variation as much as possible. Due to the popular Kinect-like scanners, which mostly applied in indoor environment, we choose to focus on simulating indoor scenes. And note that this dose not limit our BAGSFit framework to only indoor situations. Given a specific type of scenes and scanners, one should be able to adjust the random scene generation protocols similarly. Moreover, we hypothesize that the CNN is less sensitive to the overall scene layout. What’s more important is to show the CNN enough cases of different primitives occluding and intersecting with each other.

Thus, we choose to randomly generate a room-like scene with 10 meters extent at each horizontal direction. An elevated horizontal plane representing a table top is generated at a random position near the center of the room. Other primitives are placed near the table top to increase the complexity. Furthermore, empirically, the orientation of cylinder/cone axis or plane normal is dominated by horizontal or vertical directions in real world. Thus several primitive instances at such orientations are generated deliberately in addition to fully random ones. For planes, two additional disk shaped planes are added to make the dataset more general. To make the training set more realistic, two NURBS surfaces (class name “Other” in Figure 1) are added, representing objects not explained by our primitive library in reality.

An existing scanner simulator, Blensor [34], was used to simulate VGA-sized Kinect-like scans, where class and instance IDs can be easily obtained during the virtual scanning process by ray-tracing. The default Kinect scanner was adopted except that the noise sigma parameter was set to 0.005. Note that we do not carefully tune the parameters to match the simulated noise with real Kinect noise model. In fact, our simulated scanner produces slightly more noisy points than and a real Kinect sensor. To generate random scan poses, the virtual scanners were firstly placed around the center of the “table”. Then camera viewing directions were sampled on a grid of longitudinal and latitudinal intervals ranging from and , resulting in directions in total. For each direction, two distances to the table’s center ranging between m were uniformly sampled. Thus, for each scene we obtain a total number of 192 scan poses. At last, a uniform noise between was added to each viewing direction both horizontally and vertically. Figure 3 shows the screenshot of such a scan. Totally 20 scenes were generated following this protocol. 18 scenes, i.e. 3456 scans, were split for training, and the other 2 scenes, i.e. 384 scans, were used for validation. The test set is generated through a similar protocol, containing 20 scenes (each with 36 scans). Note that invalid points were converted to the zero-depth point avoiding computation issues.

Fig. 3: A simulated Kinect scan of a random scene. Black dots represents the scanned points.

Iv Boundary Aware Geometric Segmentation

Our segmentation network (Figure 1(a)) follows the same basic network as described in [12], which is based on the 101-layer ResNet [35] with minor modifications to improve segmentation performance. While the semantic segmentation CNN architecture is actively being developed, there are several design choices to be considered to achieve the best performance on a given base network for our new task.

Position vs. Normal Input

. The first design choice is about the input representation. Since we are dealing with 3D geometric data, what form of input should be supplied to the CNN? A naive choice is to directly use point positions as a 3-channel tensor input. After all, this is the raw data we get in reality, and if the CNN is powerful enough, it should be able to learn everything from this input form. However, it is unclear how or whether necessary to normalized it.

A second choice is to use estimated per-point unit normals as the input. This is also reasonable, because we can almost perceive the correct segmentation by just looking as the normal maps as shown in Figure 1(a). Plus it is already normalized, which usually enables better CNN training. However, since normals are estimated from noisy neighboring points, one might have concerns about loss of information compared with the previous choice. And a third choice is to combine the first two, resulting in a 6-channel input, through which one might hope the CNN to benefit from merits of both.

Multinomial vs. Multi-binomial Loss

. The second design question is: what kind of loss function to use? While many semantic segmentation CNN choose the multinomial cross-entropy loss through a softmax function, recent studies have found other loss functions such as the self-balancing multi-binomial loss 

[12] to perform better for certain tasks, with weights accounting for imbalanced classes. In this study, we consider two types of loss functions: 1) the classic “softmax loss”, and 2) a multi-binomial loss with class-specific loss weights as hyper-parameters:


where are the learnable parameters, a pixel index, the ground truth binary image and the network predicted probability map of the -th primitive class (), and the input data. We set to be proportional to 1 over the total number of -th class points in the training set.

Separate vs. Joint Boundary Detection. When multiple instances of a same primitive class occlude or intersect with each other, even an ideal primitive class segmentation can not divide them into individual segments, leaving a multi-instance fitting problem still undesirable for the geometric verification step to solve, which discounts the original purpose of this geometric segmentation. Moreover, boundaries usually contains higher noises in terms of estimated normals, which could negatively affect primitive fittings that use normals (e.g., 2-point based cylinder fitting). One way to alleviate the issue is to cut such clusters into primitive instances by instance-aware boundaries. To realize this, we also have two choices, 1) training a separate network only for instance boundary detection, or 2) treating boundary as an additional class to be segmented jointly with primitive classes. One can expect the former to have better boundary detection results as the network focuses to learn boundary features only, although as a less elegant solution with more parameters and longer running time. Thus it is reasonable to trade the performance a bit for the latter one. Note that with such a step, we could already move from category- to boundary- and thus instance-aware segmentation by region-grow after removing all instance-aware boundaries.

Handling of Background Class. When generating random scenes, we added NURBS modeling background points not explained by the four primitive classes, for a more realistic and challenging dataset. Thus we need to properly handle them in the CNN. Should we ignore background class when computing the loss, or add it as an additional class?

For all of the above design questions, we will rely on experiments to empirically select the best performing ones.

V Geometric Verification and Evaluation

V-a Verification by Fitting

Given the predicted probability maps , we need to generate and verify primitive hypotheses and fit primitive parameters of the correct ones to complete our mission.

One direct way of hypothesis generation is to simply binarize the BAGS output

by thresholding to produce a set of connected components, and fit only one -th class primitive for a component coming from . However, when the CNN incorrectly classify certain critical regions due to non-optimal thresholds, two instances can be connected, thus leading to suboptimal fittings or miss detection of some instances. Moreover, a perfect BAGS output may bring another issue that an instance gets cut into several smaller pieces due to occlusions (e.g., the top left cylinder in Figure 1(a)). And fitting in smaller regions of noisy scans usually result in false instance rejection or lower estimation accuracy. since the core contribution of this paper is to propose and study the feasibility of BAGSFit as a new strategy towards this problem, we leave it as our future work to develop more systematic ways to better utilize for primitive fitting.

In this work, we simply follow a classic “” prediction on over each point, and get groups of hypothesis points associated to each of the primitive classes. Then we solve times of multi-instance primitive fitting using the RANSAC-based method [9]. This is more formally described in Algorithm 1. Note this does not completely defeat the purpose of BAGS. The original RANSAC-based method feed the whole point cloud into the pipeline and detect primitives sequentially in a greedy manner. Because it tends to detect larger objects first, smaller primitives close to large ones could often be missed, as their member points might be incorrectly counted as inlier of larger objects, especially if the inlier threshold is improperly set. BAGS can alleviate such effects and especially removing boundary points from RANSAC sampling is expected to improve its performance.

function PrimitiveFitting()
      initialize hypotheses sets
     for  do assign a pixel to its best set
     for  do detect primitives from each set
Algorithm 1 Primitive Fitting from Hypotheses

V-B Primitive Fitting Evaluation

It is non-trivial to design a proper set of evaluation criteria for primitive detection and fitting accuracy, and we are not aware of any existing work or dataset that does so. It is difficult to comprehensively evaluate and thus compare different primitive fitting methods partly because 1) as mentioned previously, due to occlusion, a single instance are commonly fitted into multiple primitives, both of which may be close enough to the ground truth instance; and 2) such over detection might also be caused by improper inlier thresholds on a noisy data.

Pixel-wise average precision (AP) and AP of instances matched at various levels (5090%) of point-wise intersection-over-union (IoU) are used for evaluating image based instance segmentation problems [36]. However, this typical IoU range is inappropriate for our problem. More than 50% IoU means at most one fitted primitive can be matched for each true instance. Since we don’t need more than 50% of true points to fit a reasonable primitive representing the true one, this range is over-strict and might falsely reject many good fits: either more than 50% true points are taken by other incorrect fits, or during observation the true instance is occluded and split into pieces each containing less than 50% true points (see Figure 5 for more examples). After all, a large IoU is not necessary for good primitive fitting.

Thus, the IoU is replaced by intersection-over-true (IoT) in this problem. It indicates the number of true inliers of a predicted primitive over the total number of points in the true instance. Thus, a predicted primitive and a true instance is matched iff 1) IoT30% and 2) the predicted primitive having the same class as the true instance. This indicates that one instance can have at most 3 matched predictions.

Based on the above matching criteria, a matched instance (if exists) can be identified for each predicted primitive. On the contrary, each true instance may have several best matching prediction candidates. To eliminate the ambiguity, the candidate that has the smallest fit error is selected as the best match. To be fair and consistent, fitting error is defined as the mean distance to a primitive by projecting all of the points in the true instance onto the predicted primitive. After the matches are found, primitive average precision (PAP) and primitive average recall (PAR) are used to quantify the primitive detection quality.


where is the number of predictions having a matched true instance, the total number of predicted primitives, the number of true instance with a best prediction, and the total number of true instances, all counted over the whole test set.

Vi Experiments and Discussion

Vi-a Geometric Segmentation Experiments

Precision Recall IoU F1 Accuracy
N+BO 0.944 0.820 0.781 0.877 0.964
P 0.915 0.811 0.867 0.642 0.809 0.971 0.620 0.715 0.664 0.743 0.891 0.599 0.655 0.488 0.658 0.939 0.664 0.762 0.611 0.744 0.871
N 0.979 0.915 0.934 0.727 0.889 0.988 0.884 0.788 0.829 0.872 0.968 0.860 0.752 0.633 0.803 0.983 0.894 0.826 0.734 0.859 0.924
PN 0.978 0.913 0.919 0.710 0.880 0.984 0.868 0.806 0.797 0.864 0.962 0.847 0.758 0.601 0.792 0.980 0.882 0.838 0.711 0.853 0.920
P+MB 0.929 0.818 0.888 0.658 0.823 0.967 0.656 0.730 0.706 0.765 0.900 0.626 0.677 0.518 0.680 0.945 0.690 0.774 0.638 0.762 0.881
N+MB 0.978 0.899 0.923 0.737 0.884 0.985 0.864 0.806 0.816 0.868 0.964 0.835 0.764 0.638 0.800 0.981 0.873 0.836 0.738 0.857 0.927
PN+MB 0.979 0.911 0.900 0.677 0.867 0.949 0.860 0.792 0.805 0.852 0.930 0.839 0.737 0.576 0.771 0.958 0.875 0.817 0.686 0.834 0.894
N+BAGS 0.868 0.963 0.908 0.926 0.756 0.888 0.849 0.976 0.874 0.833 0.821 0.871 0.752 0.941 0.848 0.790 0.654 0.797 0.858 0.969 0.884 0.859 0.755 0.865 0.918
N+MB+BAGS 0.868 0.950 0.886 0.891 0.677 0.851 0.809 0.977 0.855 0.765 0.749 0.831 0.720 0.929 0.820 0.703 0.549 0.744 0.837 0.962 0.862 0.792 0.662 0.823 0.887
N5 0.980 0.917 0.940 0.744 0.895 0.979 0.877 0.809 0.808 0.868 0.960 0.854 0.776 0.642 0.808 0.979 0.889 0.844 0.741 0.863 0.940
N5+MB 0.978 0.911 0.920 0.725 0.884 0.977 0.862 0.804 0.793 0.859 0.956 0.841 0.760 0.614 0.793 0.977 0.878 0.834 0.719 0.852 0.932
N5+BAGS 0.847 0.966 0.906 0.932 0.728 0.883 0.804 0.970 0.873 0.808 0.812 0.853 0.702 0.939 0.845 0.769 0.630 0.777 0.825 0.968 0.883 0.842 0.732 0.850 0.921
TABLE I: Geometric segmentation evaluation. Red highlights the best along a column, while magenta for the top 3 best.

Fig. 4: BAGSFit (N5+BAGS) on real Kinect scans. Top: RGB image of the scanned scene. Middle: segmentation results. Bottom: fitted primitives (randomly colored) rendered together with real scans.

Network Short Names. To explore answers to the design questions raised in section IV, we designed several CNNs and their details with short names are listed as follows:

  • P/N/PN. Basic networks, using position (P), normal (N), or both (PN) as input, trained with a multinomial loss function, outputting a 4-channel mutual-exclusive class probability maps (i.e., each pixel’s probabilities sum up to one, ). Background class points, the NURBS, are ignored for loss computation.

  • P/N/PN+MB. Same as the above basic networks except trained using the multi-binomial (MB) loss function as in equation (1), outputting a 4-channel non-mutual-exclusive class probability maps (i.e., each pixel’s probabilities not necessarily sum up to one, thus being multi-binomial classifiers, ).

  • N+BAGS. Network trained with normal input and BAGS labels (i.e., instance-aware boundary as an additional class jointly trained, ).

  • N+MB+BAGS. Same as N+BAGS except trained using a multi-binomial manner ().

  • N5. Same as basic network N except treating the background class as an additional class involved in loss computation ().

  • N5+MB. Same as N5 except trained using a multi-binomial manner ().

  • N5+BAGS. Same as N+BAGS except trained using a multi-binomial manner (i.e., boundary and NURBS are two additional classes jointly trained, ).

  • N+BO. Same as N except only trained to detect boundary (i.e., a binary classifier, ).

Implementation Details. We implemented the geometric segmentation CNNs using Caffe [37] and DeepLabv2 [11]. Normals were estimated by PCA using a window. We use meters as the unit for networks requiring position input. Instance-aware boundaries were calculated if not all pixels belong to a same instance (or contain invalid points) in a window. Input data size was randomly cropped into

during training time, while full VGA resolution was used during test time. All of our networks were trained with the following hyper-parameters tuned on the validation set: 50 training epochs (i.e. 17280 iterations), batch size 10, learning rate 0.1 linearly decreasing to zero until the end of training, momentum 0.9, weight decay 5e-4. The networks were trained and evaluated using several NVIDIA TITAN X GPUs each with 12 GB memory, with a 2.5 FPS testing frame rate.

No. Primitives Fitted () No. Matched Instance() Primitive Average Precision (PAP) Primitive Average Recall (PAR) Fitting Error (cm)
ERANSAC 4596 1001 2358 3123 11078 2017 542 942 879 4380 0.395 0.453 0.541 0.402 0.286 0.403 0.500 0.432 0.403 0.443 0.456 0.915 0.324 0.766 0.954 0.810
P 5360 621 2242 2037 10260 2448 591 1219 944 5202 0.507 0.470 0.952 0.549 0.468 0.516 0.607 0.471 0.521 0.476 0.541 0.936 0.248 0.931 0.519 0.759
N 4617 961 2789 2492 10859 2565 870 1456 1254 6145 0.566 0.571 0.905 0.532 0.507 0.576 0.636 0.693 0.623 0.633 0.640 0.903 0.403 1.229 0.657 0.866
PN 4537 888 3172 2133 10730 2522 859 1498 1197 6076 0.566 0.572 0.967 0.480 0.570 0.577 0.625 0.684 0.641 0.604 0.632 0.903 0.397 1.196 0.628 0.852
P+MB 5103 654 2201 2061 10019 2373 625 1249 1010 5257 0.525 0.479 0.956 0.573 0.493 0.534 0.588 0.498 0.534 0.510 0.547 0.892 0.283 0.996 0.549 0.767
N+MB 4479 931 2732 2654 10796 2528 857 1442 1222 6049 0.560 0.581 0.921 0.536 0.466 0.571 0.627 0.682 0.617 0.617 0.630 0.896 0.397 1.187 0.719 0.865
PN+MB 4236 951 3169 2305 10661 2427 856 1455 1168 5906 0.554 0.589 0.900 0.467 0.516 0.565 0.602 0.682 0.622 0.589 0.615 0.873 0.394 1.193 0.699 0.852
N+BAGS 3893 845 2299 2108 9145 2279 796 1453 1149 5677 0.621 0.594 0.942 0.637 0.548 0.626 0.565 0.634 0.621 0.580 0.591 0.765 0.363 1.144 0.587 0.768
N+MB+BAGS 3815 800 2599 1947 9161 2249 775 1356 1002 5382 0.587 0.598 0.969 0.528 0.518 0.594 0.558 0.617 0.580 0.506 0.560 0.754 0.357 1.149 0.533 0.753
N5 3701 863 1874 1876 8314 2490 859 1458 1226 6033 0.726 0.693 0.995 0.793 0.663 0.740 0.617 0.684 0.624 0.619 0.628 0.841 0.395 1.163 0.617 0.815
N5+MB 3717 858 1920 1930 8425 2490 857 1479 1199 6025 0.715 0.689 0.999 0.783 0.634 0.729 0.617 0.682 0.633 0.605 0.627 0.842 0.398 1.148 0.659 0.821
N5+BAGS 3500 804 1765 1730 7799 2254 804 1397 1129 5584 0.716 0.654 1.000 0.796 0.658 0.723 0.559 0.640 0.598 0.570 0.581 0.742 0.367 1.096 0.555 0.740
MaskRCNN 2422 890 1771 3728 8811 1104 553 944 1017 3618 0.411 0.526 0.621 0.590 0.282 0.445 0.274 0.440 0.404 0.513 0.377 10.789 35.361 19.122 23.348 18.619
TABLE II: Primitive fitting evaluation. Red highlights the best along a column, while magenta highlights the top 3 best.

Fig. 5: BAGSFit (N5+BAGS) on simulated test scans. Top: Ground truth labels. Middle: segmentation results. Bottom: fitted primitives (randomly colored) rendered together with real scans.

Discussions. Evaluation results of all 12 networks on the test set of 720 simulated scans are summarized in table I.

  1. Comparing the P/N/PN rows, we found that normal input turned out to be the best, and interestingly outperforming combination of both normal and position. This may be caused by the difficulty in normalizing position data for network input.

  2. Comparing the P/N/PN+MB rows, we found that the classic multinomial loss leads to better performance mostly than the multi-binomial loss.

  3. Comparing the N with N+BAGS, we found that adding additional boundary detection to the segmentation only have very small negative influences to the segmentation performance. This is appealing since we used a single network to perform both segmentation and boundary detection. Further comparing the N+BAGS with N+BO, we found that BAGS in fact increases the boundary recall comparing to N+BO that only detects boundaries.

  4. Comparing the N5 with N, we found that the effect of ignoring background class is inconclusive in terms of significant performance changes, which however suggests the benefit of jointly training the background class, as this enables the following steps to focus only on regions seemingly explainable by the predefined primitive library.

Just for reference, we tried SVM using neighboring or normals or principal curvatures for this task, and the highest pixel-wise accuracy we obtained after many parameter tuning is only 66%.

MaskRCNN. We also investigated MaskRCNN [38] on this task based on a popular public implementation, since it is appealing to convert the multi-model problem directly into a single-instance one. In this experiment, the input is a normal map. The network was trained from scratch with the same datasets as BAGS for 100 (instead of 50) epochs with a base lr of 1e-3. With IoU threshold set to 0.5, the mAP of the obtained model was reported to be 26.0% (36.6% with IoU threshold of 0.3), which was much lower than its performance on regular RGB object detection tasks. One potential reason might be that the shape of primitives changes significantly compared to natural objects, but the MaskRCNN network has to resize proposed regions of varying aspect ratios into a fixed size. This change of aspect ratio might degrade CNN’s performance on this task of a geometric nature. Given this performance, we did not further evaluate its fitting results.

Generalizing to Real Data. Even though we did not tune the simulated scanner’s noise model to match our real Kinect scanner, Figure 4 shows that the network trained with simulated scans generalizes quite well to real world data.

Vi-B Primitive Fitting Experiments

For fitting primitives, we used the original efficient RANSAC implementation [9] both as our baseline method (short name ERANSAC) and for our geometric verification.

Experiment Details. We used the following parameters required in [9] for all primitive fitting experiments, tuned on the validation set in effort of maximizing ERANSAC performance: min number of supporting points per primitive 1000, max inlier distance 0.03m, max inlier angle deviation 30 degrees (for counting consensus scores) and 45 degrees (for final inlier set expansion), overlooking probability 1e-4. The simulated test set contains 4033 planes, 1256 spheres, 2338 cylinders, 1982 cones, and in total 9609 primitive instances.

Discussions. Using respective network’s segmentation as input to Algorithm 1, the primitive fitting results were evaluated on the simulated test set and summarized in table II together with the ERANSAC baseline.

  1. ERANSAC performance is significantly lower than most variants of BAGSFit, in accordance with our qualitative evaluation.

  2. N5 related experiments receives highest PAP scores, which is reasonable due to the recognition and removal of background classes that greatly reduce the complexity of scenes.

  3. In terms of average fitting error, N+BAGS N, N5+BAGS N5, N+MB+BAGS N+MB, which strongly supports the benefit of BAGS as mentioned in section V-A.

  4. N5+BAGS gets the lowest fitting error, benefiting from both background and boundary removal before fitting.

More results. Figure 5 shows more testing results. The readers may kindly refer to our supplementary video for more result visualizations.


We thank Yuichi Taguchi, Srikumar Ramalingam, Zhiding Yu, Teng-Yok Lee, Esra Cansizoglu, and Alan Sullivan for their helpful comments.


  • [1] I. Biederman, “Recognition-by-components: a theory of human image understanding.” Psychological review, vol. 94, no. 2, p. 115, 1987.
  • [2] A. Berner, J. Li, D. Holz, J. Stuckler, S. Behnke, and R. Klein, “Combining contour and shape primitives for object detection and pose estimation of prefabricated parts,” in Proc. IEEE Int’l Conf. Image Processing (ICIP).   IEEE, 2013, pp. 3326–3330.
  • [3] Y. Taguchi, Y.-D. Jian, S. Ramalingam, and C. Feng, “Point-plane slam for hand-held 3d sensors,” in Proc. IEEE Int’l Conf. Robotics and Automation (ICRA).   IEEE, 2013, pp. 5182–5189.
  • [4] L. Ma, C. Kerl, J. Stückler, and D. Cremers, “Cpa-slam: Consistent plane-model alignment for direct rgb-d slam,” in Proc. IEEE Int’l Conf. Robotics and Automation (ICRA).   IEEE, 2016, pp. 1285–1291.
  • [5] M. Dzitsiuk, J. Sturm, R. Maier, L. Ma, and D. Cremers, “De-noising, stabilizing and completing 3d reconstructions on-the-go using plane priors,” in Proc. IEEE Int’l Conf. Robotics and Automation (ICRA).   IEEE, 2017, pp. 3976–3983.
  • [6] Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. Cohen-Or, and N. J. Mitra, “Globfit: Consistently fitting primitives by discovering global relations,” in ACM Trans. Graphics, vol. 30, no. 4.   ACM, 2011, p. 52.
  • [7] P. Tang, D. Huber, B. Akinci, R. Lipman, and A. Lytle, “Automatic reconstruction of as-built building information models from laser-scanned point clouds: A review of related techniques,” Automation in construction, vol. 19, no. 7, pp. 829–843, 2010.
  • [8] J. Xiao and Y. Furukawa, “Reconstructing the world’s museums,”

    Int’l J. Computer Vision

    , vol. 110, no. 3, pp. 243–258, 2014.
  • [9] R. Schnabel, R. Wahl, and R. Klein, “Efficient ransac for point-cloud shape detection,” in Computer Graphics Forum, vol. 26, no. 2.   Wiley Online Library, 2007, pp. 214–226.
  • [10] J. T. Todd, “The visual perception of 3d shape,” Trends in cognitive sciences, vol. 8, no. 3, pp. 115–121, 2004.
  • [11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.
  • [12] Z. Yu, C. Feng, M. Y. Liu, and S. Ramalingam, “CASENet: Deep category-aware semantic edge detection,” in

    IEEE Conf. on Computer Vision and Pattern Recognition

    , 2017.
  • [13] D. Holz, S. Holzer, R. B. Rusu, and S. Behnke, “Real-time plane segmentation using rgb-d cameras,” in Robot Soccer World Cup.   Springer, 2011, pp. 306–317.
  • [14]

    C. Feng, Y. Taguchi, and V. R. Kamat, “Fast plane extraction in organized point clouds using agglomerative hierarchical clustering,” in

    Proc. IEEE Int’l Conf. Robotics and Automation (ICRA).   IEEE, 2014, pp. 6218–6225.
  • [15] A. Abuzaina, M. S. Nixon, and J. N. Carter, “Sphere detection in kinect point clouds via the 3d hough transform,” in Int’l Conf. Computer Analysis of Images and Patterns.   Springer, 2013, pp. 290–297.
  • [16] T.-T. Tran, V.-T. Cao, and D. Laurendeau, “esphere: extracting spheres from unorganized point clouds,” The Visual Computer, vol. 32, no. 10, pp. 1205–1222, 2016.
  • [17] A. Leonardis, A. Gupta, and R. Bajcsy, “Segmentation of range images as the search for the best description of the scene in terms of geometric primitives,” 1990.
  • [18] Z. Toony, D. Laurendeau, and C. Gagné, “Describing 3d geometric primitives using the gaussian sphere and the gaussian accumulator,” 3D Research, vol. 6, no. 4, p. 42, 2015.
  • [19] K. Georgiev, M. Al-Hami, and R. Lakaemper, “Real-time 3d scene description using spheres, cones and cylinders,” arXiv preprint arXiv:1603.03856, 2016.
  • [20] W. Martens, Y. Poffet, P. R. Soria, R. Fitch, and S. Sukkarieh, “Geometric priors for gaussian process implicit surfaces,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 373–380, 2017.
  • [21] O. Chum and J. Matas, “Matching with prosac - progressive sample consensus,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2005, pp. 220–226.
  • [22] B. J. Tordoff and D. W. Murray, “Guided-mlesac: faster image transform estimation by using matching priors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1523–1535, 2005.
  • [23] T. Sattler, B. Leibe, and L. Kobbelt, “Scramsac: Improving ransac’s efficiency with a spatial consistency filter,” in Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2009, pp. 2090–2097.
  • [24] K. Ni, H. Jin, and F. Dellaert, “Groupsac: Efficient consensus in the presence of groupings,” in Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2009, pp. 2193–2200.
  • [25] L. Magri and A. Fusiello, “Multiple model fitting as a set coverage problem,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3318–3326.
  • [26] R. Toldo and A. Fusiello, “Robust multiple structures estimation with j-linkage,” Proc. European Conf. Computer Vision (ECCV), pp. 537–547, 2008.
  • [27] J. y. Tat-Jun Chin and D. Suter, “Accelerated hypothesis generation for multistructure data via preference analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 625–638, 2012.
  • [28] T. T. Pham, T.-J. Chin, J. Yu, and D. Suter, “The random cluster model for robust geometric fitting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, pp. 1658–1671, 2014.
  • [29] L. Magri and A. Fusiello, “Robust multiple model fitting with preference analysis and low-rank approximation.” in Proc. British Machine Vision Conf. (BMVC), vol. 20, no. 1-20, 2015, p. 12.
  • [30] H. Isack and Y. Boykov, “Energy-based geometric multi-model fitting,” Int’l J. Computer Vision, vol. 97, no. 2, pp. 123–147, 2012.
  • [31] O. J. Woodford, M.-T. Pham, A. Maki, R. Gherardi, F. Perbet, and B. Stenger, “Contraction moves for geometric model fitting,” in Proc. European Conf. Computer Vision (ECCV).   Springer, 2012, pp. 181–194.
  • [32] D. Barath and J. Matas, “Multi-class model fitting by energy minimization and mode-seeking,” arXiv preprint arXiv:1706.00827, 2017.
  • [33] P. Amayo, P. Pinies, L. M. Paz, and P. Newman, “Geometric multi-model fitting with a convex relaxation algorithm,” arXiv preprint arXiv:1706.01553, 2017.
  • [34] M. Gschwandtner, R. Kwitt, A. Uhl, and W. Pree, “Blensor: blender sensor simulation toolbox,” Advances in visual computing, pp. 199–208, 2011.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016.
  • [36] M. Bai and R. Urtasun, “Deep watershed transform for instance segmentation,” arXiv preprint arXiv:1611.08303, 2016.
  • [37] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM Multimedia, 2014.
  • [38] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2017, pp. 2980–2988.