Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds

06/07/2021 ∙ by Kaizhi Yang, et al. ∙ USTC 0

Representing complex 3D objects as simple geometric primitives, known as shape abstraction, is important for geometric modeling, structural analysis, and shape synthesis. In this paper, we propose an unsupervised shape abstraction method to map a point cloud into a compact cuboid representation. We jointly predict cuboid allocation as part segmentation and cuboid shapes and enforce the consistency between the segmentation and shape abstraction for self-learning. For the cuboid abstraction task, we transform the input point cloud into a set of parametric cuboids using a variational auto-encoder network. The segmentation network allocates each point into a cuboid considering the point-cuboid affinity. Without manual annotations of parts in point clouds, we design four novel losses to jointly supervise the two branches in terms of geometric similarity and cuboid compactness. We evaluate our method on multiple shape collections and demonstrate its superiority over existing shape abstraction methods. Moreover, based on our network architecture and learned representations, our approach supports various applications including structured shape generation, shape interpolation, and structural shape clustering.



There are no comments yet.


page 1

page 6

page 7

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

3D objects, especially manufactured objects, often have regularity in their external geometry. For example, chairs, typically contain regular parts such as the seat and back to achieve their functionality of stable placement. These design rules can be reflected on the object structure, including a specific part shape or the relationship among parts. Understanding objects from the structural perspective makes it easier to understand ”why they are designed this way” or ”why they look the way they do”. While early vision tasks tend to understand an object from a holistic geometric perspective, some recent works attempt to incorporate more structural information into 3D shape understanding or modeling tasks (Li et al., 2017; Mo et al., 2019a; Yang et al., 2020). Though these structure-based approaches have shown significant advantages, object structure, which is often hidden behind the geometry, is not easy to acquire. Many researchers have attempted to use manual annotations to attach structural information to the object shape (Yi et al., 2016; Mo et al., 2019b), due to the fact that humans can easily recognize the structure of objects. However, the automatic acquisition of structural representations for common objects is a challenging task.

In this paper, we focus on unsupervised shape abstraction, which dedicates to parsing objects into concise structural representation. Since the geometry of each part is much simpler than that of the whole, shape abstraction attempts to use simple geometric primitives to assemble an object. Previous works often employ data like volumetric (Tulsiani et al., 2017), watertight mesh (Sun et al., 2019), or signed distance field (Smirnov et al., 2020) as the supervision information, since spatial occupancy information can provide important guidance for the shape abstraction task. However, these data are not intuitive to acquire and require extra geometric processing.

On the other hand, point clouds are more similar to the raw scan data which is much easier to acquire by LiDAR or RGBD cameras. However, little work has been done on learning primitive representations only through point clouds, since discrete point clouds lack dense occupancy supervision. The sparsity of point clouds easily causes degeneration in structural representation. Fig. 2 shows a 2D example. Considering a point cloud sampled from the surface of a rectangle, the two cuboid representations match its surface geometry as we can divide these points into arbitrary groups and fit them with different cuboids separately. Thus polysemy is introduced. In this case, the shape abstraction task degenerates to surface fitting task with planes, and each primitive loses its structural properties.

Figure 2. Ambiguity and degradation problems in the shape abstraction task for point clouds. The cuboid abstraction results on both sides well fit the point cloud in geometry. Compared to the left side which abstracts the point cloud as one single cuboid, the representation at the right side degenerates to surface fitting with a set of planes and ignores its structure.

Compared to shape abstraction which focuses more on geometry fitting, the co-segmentation task of point clouds is biased towards extracting the common structure of the whole dataset in order to prevent ambiguity and degeneration. Our main idea is to learn a general rule for assigning points to different cuboid primitives consistently among shapes, rather than directly based on the geometry proximity between the points and cuboids. Moreover, for the joint tasks of segmentation and structural abstraction without structural annotations, we design several easy-to-implement loss functions that do not need any sampling operation on parametric surfaces.

Under the unsupervised learning framework, we propose a shape abstraction network based on a variational autoencoder (VAE) to embed the input point cloud into a latent code and decode it into a set of parameterized cuboids. In addition, we make full use of the high-level feature embedding to exploit the point-cuboid correlation for the segmentation task. Benefited from the joint segmentation branch, although some supervisory information is lacking, our algorithm is able to extract finer object structures than previous methods and obtain consistent part segmentation among instances with diverse shapes in the same category. Based on the latent space embedding, our method not only supports abstract shape reconstruction from point clouds but also supports shape interpolation and virtual shape synthesis by sampling the latent shape code.

In summary, this paper makes the following key contributions:

  • We propose an unsupervised framework that entangles part segmentation and cuboid-based shape abstraction for point clouds.

  • A set of novel loss functions are designed to jointly supervise the two tasks without requiring manual annotations on parts or structural relationships.

  • In addition to shape abstraction and segmentation, our framework supports a range of applications, including shape generation, shape interpolation, and structural shape clustering.

2. related work

Manufactured objects always exhibit strong structural properties. Understanding the high-order structural representation of shapes has been a hot research topic in the field of geometric analysis. The main difficulty of this task is that structural properties are embedded within the geometric shape. Moreover, structure and geometry are intertwined and affect each other, making it challenging to decouple part structures through geometric shapes. In this section, we discuss the most relevant works on both supervised and unsupervised learning for object structures and shape co-segmentation.

Supervised learning for object structure

Some researchers have paid their attention to supervised learning for object structures from large-scale datasets with manual structural annotations

(Yi et al., 2016; Mo et al., 2019b) or decomposed parts using traditional optimization algorithms (Zou et al., 2017). These approaches can be divided into three categories according to the way how object parts are organized. Approaches in a sequential manner (Zou et al., 2017; Wu et al., 2020)

employ recurrent neural networks to encode and decode parts sequentially. In the parallel manner, parts are organized in a flat way

(Wang et al., 2018; Gao et al., 2019; Wu et al., 2019; Schor et al., 2019; Dubrovina et al., 2019; Li et al., 2020; Gadelha et al., 2020)

. Tree-based manner has attracted a lot of attention recently. Different levels in the tree structure represent different granularity of parts and parent-child node pairs represent the inclusion relationship between parts. Most tree-based methods require ground truth of the relationship between parts like adjacency and symmetry to construct the tree. A generative recursive neural network (RvNN) is designed to generate shape structures and trained with multiple phases for encoding shapes, learning shape manifolds, and synthesizing geometric details respectively 

(Li et al., 2017). Niu et al. (2018) employ an RvNN to decode a global image feature in a binary tree organization recursively. A similar binary tree structure is also used for point cloud segmentation to utilize structural constraints across different levels (Yu et al., 2019). In order to encode more elaborate structural relationships between parts, StructureNet is introduced to integrate part-level connectivity and inter-part relationships hierarchically in a -ary graph (Mo et al., 2019a). Yang et al. (2020) design a two-branch recursive neural network to represent geometry and structure in 3D shapes explicitly for shape synthesis in a controllable manner.

Unsupervised structural modeling

On the other hand, many approaches use unsupervised learning for the structure parsing task assuming that the object geometry naturally reveals structural properties. One direction is unsupervised shape abstraction, which ensembles 3D shapes by geometric primitives while preserving consistent structures in a shape collection. Tulsiani et al. (2017) make the first attempt to apply neural networks for abstracting 3D objects with cuboids without part annotations. A coverage loss and a consistency loss are developed to encourage mutual inclusion of target objects and predicted shapes. Sun et al. (2019) propose an adaptive hierarchical cuboid representation for more compact and expressive shape abstraction. They construct multiple levels for cuboid generation in different granularities and use a cuboid selection module to obtain the optimal abstraction. Besides cuboids, representing 3D shapes with more types of parametric primitives has been studied recently. Smirnov et al. (2020) define a general Chamfer distance in an Eulerian version based on distance field, which allows abstraction of multiple parametric primitives. Superquadric representation is another option for enhancing the geometric expressiveness and is demonstrated easier to learn than the cuboid representation for curved shapes (Paschalidou et al., 2019). They only use point clouds as supervision by constructing a bidirectional Chamfer distance between the predicted primitives and the point cloud. However, it requires differentiable uniform sampling over the primitives, which is not easy. In this paper, we design a new formulation of single-direction point-to-cuboid reconstruction loss that avoids sampling points on the cuboid surfaces.

Unsupervised segmentation

Some unsupervised object segmentation works can also be used to analyze object structures. Chen et al. (2019) treat co-segmentation as a shape representation learning problem. They learn multiple implicit field branches for representing individual parts of an object and preserving consistency of the segmented parts over the entire dataset. Aiming to learn the dense correspondence between 3D shapes in an unsupervised manner, a novel implicit function is proposed to measure the correspondence degree between points in different shapes and to obtain co-segmentation results (Liu and Liu, 2020). Lin et al. (2020) develop an efficient approach based on the medial axis transform to identify junctions between parts of a 3D shape for segmentation. However, this method does not consider the relationship among shapes like the common structure. In comparison, we jointly learn shape abstraction and segmentation in an unsupervised manner while preserving the structural consistency over different instances on the segmentation task.

Figure 3. Overview of our joint shape abstraction and segmentation network. To connect the shape abstraction task and point segmentation task, our approach consists of three parts: the feature embedding network, shape abstraction network, and cuboid-associated segmentation network. As a result, our network obtains the cuboid-based structural representation and the point cloud segmentation results simultaneously.

3. Our Approach

Given a point cloud that contains 3D points of an object, our goal is to reconstruct a set of cuboids to concisely represent the 3D shape of the object. Similar to previous methods (Tulsiani et al., 2017; Sun et al., 2019), each cuboid

is parameterized into three vectors including a translation

, a quaternion representing 3D rotation , and a scale .

For different objects in various shapes of the same category, we attempt to predict a fixed number of cuboids. The fixed order of cuboids naturally conveys the part correspondence in different instances of the same category. However, even within the same category of objects, the structure of each instance varies a lot. For example, some chairs have armrests while others do not. To fit different structures, we add an indicator for each cuboid to indicate whether this cuboid appears in an instance. In summary, we parameterize each cuboid as an D vector which includes the geometric properties and existence of the cuboid.

We employ a variational auto-encoder (VAE) (Kingma and Welling, 2014) framework that embeds the input point cloud into a latent code and decodes the parameters of the cuboids using a decoder, as shown in the feature embedding network and shape abstraction network in Fig. 3. Meanwhile, we design a segmentation network that integrates an attention scheme to build the point-cuboid correlations for the point allocation for the cuboids. To train the encoder-decoder network without manual annotations on part-based structures, we jointly learn the shape abstraction branch and the segmentation branch by enhancing the consistency between the part segmentation and reconstructed cuboids through a set of specially designed losses.

3.1. Feature embedding network

The feature embedding network maps the input point cloud into a latent code

that follows a Gaussian distribution. We first extract

point-wise features with that contains two EdgeConv layers (Wang et al., 2019), and each layer yields a D feature for each point in each level. By concatenating these two features together, the point feature is obtained. Then we extract a D global feature by feeding the

point features into a fully-connected layer with max-pooling 

(Qi et al., 2017).

As a generative model, we map the global feature to a latent space. We design two branches composed of fully connected layers to predict the mean

and standard deviation

of the Gaussian distribution of the latent variable, respectively. Then the latent code is obtained by re-parameterizing a random noise

that follows to a standard normal distribution:


where represents element-wise multiplication.

3.2. Shape abstraction network

The shape abstraction sub-network decodes the latent code into the parameters for cuboids. In order to retain the high-level structure information with part correspondence, we first infer cuboid-related features respectively by sub-branches for each cuboid from the latent code . In the -th branch, we embed a one-hot vector where for and for with a cuboid code encoder to obtain a D embedding vector for the -th cuboid. We concatenate the cuboid embedding vector with the latent code and then fed into a cuboid feature encoder to obtain a D cuboid feature . With the fixed order of cuboids and the one-hot cuboid codes, the part correspondences are preserved implicitly in the network so that the decoder not only contains information about the geometric shape information of the corresponding cuboids but also embeds the structure of a specific part in the object.

Then, each passes through a cuboid parameter prediction module

to estimate the geometric parameters and existence probability for each cuboid. Note that the feature encoders

, , and the cuboid decoder in each cuboid branch are all composed of fully connected layers and share parameters among all cuboids.

3.3. Cuboid-associated segmentation network

The segmentation branch allocates each point in the input point cloud to cuboids. It means that we perform -label point cloud segmentation, where each label corresponds to a cuboid, which is a potential part under the common structure of an object category. We use two fully-connected layers to reduce the dimension of point features and the cuboid features to D feature vectors and

respectively. Treating these point features and cuboid features as the query and key in an attention scheme, we compute the affinity matrix

between points and cuboids as


Then a softmax operation is performed on each row to obtain the probability distribution that the point

belongs to the part that the cuboid represents. Thus, we get a probability matrix


From this probabilistic point-to-cuboid allocation matrix, we can simply obtain the segmentation result of a point cloud using argmax.

3.4. Loss functions

In order to train our network without ground-truth segmentation or cuboid shape representations, we design several novel losses between the results of the segmentation and abstraction branches to enforce the geometric coherence and structure compactness. More specifically, we design a reconstruction loss between the segmentation and cuboid abstraction. A compactness loss is designed to encourage the network to learn a more compact structural representation based on the point-cuboid allocation in the segmentation branch. The cuboid existence loss is designed to predict the existence indicator for each cuboid. To enable the capability of shape generation, a KL-divergence loss is designed to enforce the latent code to follow a standard Gaussian distribution.

3.4.1. Reconstruction loss

While no ground-truth cuboid annotations can be obtained in our unsupervised framework, the segmentation branch is utilized to provide a probabilistic part assignment for the input points. The reconstruction loss is expected not only to minimizing the local geometric distance but also encourages consistent high-order part allocations. We calculate the distance between each point of the input point cloud and each predicted cuboid and sum them with the probabilistic assignment predicted by the segmentation network as the shape reconstruction loss:


This loss function tends to reduce the geometric deviation for a point to the cuboid with high weights . In other words, this loss measures the compatibility of the segmentation branch with the abstraction branch. The shape parameters of a particular cuboid are optimized according to the weighted assignment , while the point-cuboid allocation probability is adjusted according to the geometric shape of the cuboids. Through this loss, we jointly optimize the cuboid parameters and the co-segmentation map.

Note that Eq. (4) is actually a weighted single-direction distance from the point cloud to the cuboids. Compared with the bidirectional Chamfer distance, it does not require a differentiable sampling operation on the cuboid surface. Moreover, compared with training with Chamfer distance where points are assigned to the cuboids based on the distance between them, this formulation allows our model to jointly learn the cuboid assignment explicitly, making it less likely stuck in a local minimum of geometric optimization.

However, single-direction distances in general lead to model degeneration. We introduce normal information for the reconstruction loss to prevent degradation. Instead of calculating the distance from to the closest cuboid plane, we calculate the distance from to the cuboid surface with the most similar normal direction as . On the other hand, in order to emphasize the normal similarity and enhance the robustness to noises in point clouds, we introduce an additional sampling strategy when computing the distance. We sample a new point along the normal direction of with a random distance from a Gaussian distribution . For the point , we look for its nearest point on the selected cuboid surface and define the distance from point to a cuboid as


Fig. 4 illustrates the distance definition considering both normal similarity and point sampling along the normal distribution. Under this definition, only when lies on the surface of the cuboid with same normal. Unless specifically mentioned, is setting to 0.05 in our experiments.

Figure 4. Illustration of our point-to-cuboid distance for calculating the reconstruction loss. For a point in the input point cloud, we sample a point along the normal direction and look for its closest point on the cuboid surface which has the most similar normal with . The Euclidean distance between and is defined as .

3.4.2. Cuboid compactness loss

Typically, there exist multiple combinations of cuboids to represent an object. More cuboids tend to result in more accurate shape approximation. However, the structure of an object is expected to be concise and clear. Thus, a small number of cuboids is preferred for shape abstraction. We design a cuboid compactness loss to penalize a large number of cuboids.

In the semantic segmentation task, when there is no label of a category in the segmentation result, one can consider that the object of that category does not appear. Therefore, we impose a sparsity loss function on the segmentation result to reduce the number of cuboids used. From our point-cuboid allocation probability matrix , we can compute the portion of how many points likely to be allocated to each cuboid as . Analogously, if there are no points likely to be assigned to a certain cuboid , i.e. , we regard the structure represented by the cuboid as absent in the 3D shape. Therefore, we compute the compactness of the shape abstraction directly from for each cuboid. Though loss is typically used to achieve sparseness, in our case as defined in Eq. (3), the optimization process of loss does not update, as illustrated in Fig. 5. Instead, we adopt norm to compute the compactness loss as:


The small constant is added to prevent gradient explosion when a cuboid has no points allocated during training.

Figure 5. The illustration of , and in the case of two cuboids, i.e. (black line). The gradient of loss (red dotted line) will lead to after optimization. The gradient of loss is vertical to the black line, making no update of and with the constraint . Minimizing leads to or , i.e. the desired sparse solution.

3.4.3. Cuboid existence loss

As mentioned in the parametric representation of the cuboids, we predict the existence of a certain cuboid so that we can allow a varied number of cuboids to represent the 3D shape for different instances. On the other hand, the point-cuboid segmentation branch can naturally handle a varying number of cuboids. Therefore, we consider the portion of points allocated to each cuboid after segmentation as its existence ground truth. We set a threshold so that when the number of points allocated to a cuboid by the allocation matrix , we consider the cuboid existence , otherwise . In our experiments, we set .

We use binary cross-entropy between the predicted existence indicator for the cuboid in the shape abstraction sub-network and the approximated ground truth as the cuboid existence loss:

Figure 6. Comparison of cuboid reconstruction results. (a) Input point clouds. (b) Results of VP (Tulsiani et al., 2017). (c) Results of HA (Sun et al., 2019). (d) Our results. VP tends to parse the target shapes into oversimplified structures. While HA uses a hierarchical representation, it is not easy to choose an appropriate grain for each shape, leading to over-partition or under-partition. Both VP and HA fail to capture some small components, like the aircraft engines, brackets between the legs of chairs, etc. In comparison, our method produces more realistic structures.

3.4.4. Latent code KL-divergence loss

As in the vanilla VAE model (Kingma and Welling, 2014), we also assume that the D latent code conforms to a standard Gaussian distribution with the assumption of independence of each dimension. We use KL divergence as the loss function for distribution constraints:


This VAE module with KL divergence loss supports shape generation and manipulation applications at a high-order structure level.

3.4.5. Network training

We train our network end-to-end with the total loss


We set , and

in training. The network is implemented in the PyTorch framework

(Paszke et al., 2019). All the experiments were conducted on a NVIDIA Geforce GTX1080Ti GPU. We train each category separately. The biases for the convolutional layers for predicting the cuboid scales and translations are all initialized to 0, and the one to predict rotation quaternions is initialized to

, while all the other parameters of the network are randomly initialized. We use the Adam optimizer for a total of 1000 epochs, setting the batch size 32 and the learning rate


Figure 7. Visualization of our segmentation results on four object categories. The same color represents the cuboids of the same serial number. Although only simple cuboid representation is used, our method yields reasonable segmentation results for even non-cubic structures, such as aircraft fuselages, irregular seats, and animals. Our unsupervised segmentation results show consistency among shapes, such as seat backs, tabletops, etc., which are represented using the cuboids with the same order across examples.

4. Experiments and Analysis

Our joint shape abstraction and segmentation network obtains not only accurate structural representations but also highly consistent point cloud segmentation results. In this section, we mainly evaluate our method for the shape reconstruction task and the point cloud co-segmentation task and demonstrate its superiority compared to other shape abstraction methods. Then we conduct ablation studies to verify the effectiveness of each component in our approach.

4.1. Structured shape reconstruction

Method Airplane Chair Table Animal
VP (Tulsiani et al., 2017) 0.725 1.006 1.525 1.896
HA (Sun et al., 2019) 0.713 1.109 1.449 0.575
Ours 0.329 0.399 0.848 0.350
Table 1. Quantitative comparison of shape reconstruction performance by Chamfer Distance. Our method outperforms the two state-of-the-art cuboid abstraction methods on the four categories.

We first evaluate the shape reconstruction performance of our method and provide quantitative and qualitative comparisons to the previous cuboid-based shape abstraction methods.

For quantitative reconstruction evaluation, we use four categories of shapes: airplane (3640), chair (5929), table (7555) from ShapeNet dataset (Chang et al., 2015) and four-legged animal (129) from (Tulsiani et al., 2017). We divide the data into training data and test data in a 4:1 ratio same as (Sun et al., 2019). Due to the limited training data of the animal category, data augmentation with rotation by and around the -axis for each model is performed. For the four categories of shapes, we set to 20, 16, 14, and 16, respectively. All shapes are pre-aligned and normalized to the unit scale.

We compare our method to two state-of-the-art cuboid-based shape abstraction methods: VP (Tulsiani et al., 2017) and HA (Sun et al., 2019). To evaluate the reconstruction performance, we adopt the commonly-used Chamfer Distance (CD) (Barrow et al., 1977) between two point sets. The predicted point set is obtained by uniformly sampling over the predicted model composed of parametric cuboids. In our experiments, we use symmetric CD and evaluate on 4096 points sampled from the predicted model and the input point cloud. Table 1 shows the quantitative comparison of our method with other cuboid-based shape abstraction methods. Our method outperforms the two state-of-the-art methods on the four object categories, demonstrating its capability for better geometric reconstruction by understanding various object structures. A group of reconstructed results using the three methods are shown in Fig. 6. Both VP and HA methods have difficulty in capturing fine object structures, such as the armrests of chairs and the connection bars between the table legs. In particular, VP tends to generate under-partitioned models. On the other hand, due to the way of multi-level abstraction and selection in HA, some thin structures are forcibly divided into multiple small parts, such as chair legs. In comparison, our method is able to extract more concise and precise results.

4.2. Shape co-segmentation

mIOU Airplane Chair Table
VP (Tulsiani et al., 2017) 37.6 64.7 62.1
SQ (Paschalidou et al., 2019) 48.9 65.6 77.7
HA (Sun et al., 2019) 55.6 80.4 67.4
BAE-Net (Chen et al., 2019) 61.1 65.5 87.0
BSP-Net (Chen et al., 2020) 74.5 82.1 90.3
Ours 64.2 82.0 89.2
Table 2. Quantitative evaluation of shape co-segmentation performance. Our method performs the second-best while the best BSP-Net employs more complicated shape primitives.

Though mainly designed for cuboid abstraction, our network also supports point cloud segmentation with part correspondences in one object category. In this section, we compare a number of shape decomposition networks on the task of unsupervised point cloud segmentation. In addition to VP and HA approaches, we also compare with three state-of-the-art shape decomposition networks, Super Quadrics (SQ) (Paschalidou et al., 2019), Branched Auto Encoders (BAE) (Chen et al., 2019), and BSP-Net (BSP) (Chen et al., 2020). SQ uses superquadrics as geometric primitives, which have more flexibility than the cuboid representation. BAE, on the other hand, uses the distance field as the base representation and generates a complex object jointly by predicting multiple relatively simple implicit fields. BSP is based on the binary space-partitioning algorithm that cuts the 3D space with multiple planes to depict the surface of a shape. The polyhedra composed of multiple planes can be used as primitives to represent object shapes. Note that a variety of training methods are introduced in BAE, while we choose to compare with its unsupervised baseline.

We conducted a quantitative comparison on three categories of shapes: airplane (2690), chair (3758), and table (5271) in the ShapeNet part dataset (Yi et al., 2016). Since we perform structural shape modeling, following (Chen et al., 2020), we reduce the original semantic annotation in the dataset from (top, leg, support) to (top, leg) for the table category by merging the ‘support’ label with ‘leg’. We adopt the mean per-label Intersection Over Union (mIOU) as the evaluation criterion for the segmentation task. Since the segmentation branch of our network does not refer to any semantics, we assign semantic annotations to each geometric primitive as following. We first randomly take out 20% of the shapes in the dataset, count the number of points belonging to a ground truth annotation for each primitive in the segmentation branch, and finally assign the ground truth annotation with the highest number of points as the label for that primitive. Afterward, these labels are transferred to the whole test set.

The quantitative segmentation results on the co-segmentation task are compared in Table 2. Our method achieves the best results within the cuboid-based approaches (VP, SQ, and HA) and ranks second among all the methods, demonstrating that our method generates segmentation results with higher semantic consistency. Fig. 7 shows the segmentation results of our network. In addition to the above three categories, we also show segmentation results on the animal category. Notice that our point cloud segmentation results are consistent with shape abstraction, which proves the effectiveness of our joint learning. Our method is able to subtly partition fine shape details, such as the engine of airplanes and the connection structure between the seat and the backrest. Moreover, despite we adopt the cuboid representation, our network handles non-cubic structures well, for example, the fuselage of airplanes, the backrest and cushions of a sofa, and the animals. In addition, it can be seen that our segmentation results exhibit a strong semantic consistency by using the same cuboid to express the same structure in different instances.

4.3. Ablation study and analysis

Our key idea is to learn the cuboid abstraction and shape segmentation jointly for mutual compensation. To this end, we design a two-branch network with several loss functions to promote compatibility between them. In this section, we disentangle our network and analyze the effect of each loss function by a group of ablation experiments.

Role of the segmentation module.

To train a shape abstraction model in an unsupervised manner, the most important thing is to assign parts of the input point cloud into corresponding primitives. A reasonable part allocation will facilitate the learning process. We explicitly learn the allocation weights in Eq. 3 by the segmentation branch. To verify its effectiveness for unsupervised learning, we first remove the segmentation branch and directly assign points to its closest cuboid, i.e. when is the closest cuboid of , otherwise . This variant is denoted as P2C-Dis. Furthermore, we also train a model (Chamfer-Dis) without the segmentation branch but using the bidirectional Chamfer distance as reconstruction loss, as most previous unsupervised methods do. For a fair comparison, we train the model with the segmentation branch using the reconstruction loss only, denoted as P2C-Seg.

We compared the reconstruction results of the three variants in Table 3, which shows that the P2C-Seg outperforms the other two variants on three categories. In Fig. 8, we visualize the part allocation results of the above three methods. For P2C-Seg, we directly visualize the segmentation results predicted by the segmentation branch. For Chamfer-Dis and P2C-Dis, as they assign a point to its closest cuboid primitive for computing reconstruction loss, we attach the label of the nearest cuboid to each point. It shows that the part allocation results of P2C-Dis and Chamfer-Dis are more scattered than those of P2C-Seg, leading to stacked cuboids. In addition, since it is difficult to adjust the hard assignment based on geometric distance, the P2C-Dis and Chamfer-Dis models usually get stuck in local minima. For example, in the right column of Fig. 8, the four chair legs are stuck into one large cuboid rather than four small cuboids. In contrast, the part allocations of P2C-Seg are more explicit so that more compact abstraction results are achieved.

Figure 8. Cuboid assignment and abstraction results obtained from three variants with different reconstruction supervision. (a) P2C-Dis. (b) Chamfer-Dis. (c) P2C-Seg. The segmentation module in (c) helps shape abstraction with more reasonable part allocation.
Category P2C-Dis Chamfer-Dis P2C-Seg
Airplane 4.836 0.286 0.237
Chair 3.356 0.834 0.436
Table 3.335 0.992 0.846
Table 3. Quantitative evaluation of cuboid reconstruction by Chamfer distance on three object categories using different supervision manners.
Point-to-cuboid distance.

The implementation of the point-to-cuboid distance in Eq. (4) greatly affects the network optimization process. We make two designs on projection manner and random sampling to prevent model degeneration and enhance robustness to shape noises in . Next, we will disentangle these two designs and analyze their impacts respectively.

An intuitive way of computing is to project a point to its closest cuboid surface. However, as shown in Fig. 9, there exists a degenerate solution since the surface normal consistency is ignored. This distance can be sufficiently small as long as there is at least one cuboid surface near each point. In contrast, the projection manner according to the normal direction not only ensures that the cuboid surface to be optimized has a similar surface normal with the point but also helps this normal-similar cuboid surface become the closest surface as the optimization proceeds.

While the projection manner selects which cuboid surface to compute the point-to-cuboid distance, we do not directly use the shortest distance from a point to a surface. Instead, we randomly sample a point , project it to the selected cuboid surface to find its closest point . The distance - only when lies on the surface of which has the same normal direction with . It enforces the network to emphasize geometric distance as well as normal similarity without increasing the computational complexity. In contrast, the bidirectional Chamfer distance used in previous methods requires double calculation and sampling on cuboid surfaces.

Moreover, this random sampling also improves the robustness to noisy point clouds. We add Gaussian noises to the input clouds with varying variance

and compare the reconstruction quality using different point-to-cuboid distances in Table 4. The models trained with minimum-distance projection appear degenerate, while the models with normal-similar projection achieve satisfactory reconstruction accuracy. The random sampling () drives the network to pay more attention to surface orientations, where the normal consistency is generally higher than those without point sampling along the normal direction (). In addition, the models trained with normal sampling show better robustness to noises. While increases, the reconstruction quality using normal sampling during training is also generally better than those without sampling.

Figure 9. Degenerate solutions of minimum point-to-cuboid distance without considering normal consistency. The resulted cuboids tend to be thin and stacked in one direction.
Distance Normal
0.00 56.042/0.451 0.386/0.758 0.389/0.762 0.399/0.763
0.01 57.371/0.448 0.463/0.732 0.480/0.730 0.442/0.736
0.02 57.450/0.465 0.626/0.667 0.598/0.679 0.572/0.671
0.03 58.734/0.461 0.909/0.641 0.727/0.645 0.754/0.652
Table 4. Robustness of our point-to-cuboid distance. Chamfer distance (smaller is better) and normal consistency (larger is better) between the reconstructed shape and the input point cloud under various noises with different sampling range are reported.
Compactness loss.

The compactness loss is designed to penalize redundant cuboids. We adjust the weight exponentially for the compactness loss and analyze the reconstruction results. In Table 5, we report the average number of cuboids () used in the shape abstraction results for all the instances, reconstruction quality (CD), and mIOU for part segmentation. As increases, gradually decreases as analyzed in Sec. 3.4.2 and the CD increases due to the limited cuboids to represent the shape. Fig. 10 shows the abstraction results of three shapes with different weights . When (b), the network is freely optimized without constraint on the number of cuboids, leading to over-disassembled shapes with redundant cuboids. From (a) to (d), the increasing leads to more concise structural representation. When the , the network is too stingy with the use of cuboids resulting in the loss of some structure, which also is reflected in the CD and mIOU in Table 5. On the other side, the abstraction shapes are structurally consistent among multiple instances of the category under different settings of . In each column, the common parts of different instances, such as the back, the seat, and four legs, are consistently represented.

Figure 10. Abstraction results under different weights for the compactness loss. We indicate the number of used cuboids on the right of each abstraction. Our method produces structurally consistent results in the same category with various weights. Suitable weights lead to concise and accurately structured representations.
0.00 0.05 0.10 0.20
15.486 11.144 9.753 5.544
CD 0.381 0.385 0.399 1.452
mIOU 81.2 82.1 82.0 72.5
Table 5. Abstraction performance with different weight for the compactness loss.
Choice of the cuboid number

In our experiments, we set the number of cuboids empirically for different categories. To evaluate the sensitivity of our method to , we change while fix all the other hyper-parameters to train our model. In Table 6, we report the average Chamfer distance and the average number of used cuboid in the abstraction results under different for three categories. As expected, the CD decreases as increases since more cuboids can be used to deal with diverse structures and better fit fine shapes. Actually, decreasing with fixed weight is analogous to increasing with fixed when training our network. Similar abstraction results can be obtained with Fig. 10.

Category 8 12 16 20 24
Airplane CD 0.849 0.484 0.394 0.329 0.285
4.171 5.663 9.801 11.801 13.720
Chair CD 1.381 0.501 0.399 0.385 0.366
5.976 7.150 9.753 10.130 12.158
Table CD 1.170 1.062 0.826 0.776 0.657
5.636 7.328 8.524 9.747 12.573
Table 6. Impact of the maximum cuboid number on the shape abstraction task. We present the average CD and the for comparison.
Robustness on sparse point clouds.

To verify the robustness of our framework to the density of input point clouds, we train our model with different numbers of points, including 256, 1024, and 4096, on the chair category. In Fig. 11, we show the abstraction and segmentation results under different point cloud densities for the same shape. It shows that our method produces reasonable and consistent structural representation for the same shape with various point densities.

Figure 11. Segmentation and abstraction results trained with different point numbers. Reasonable structural representation with semantic consistency can be obtained even using fewer points.

4.4. Applications

Based on our network architecture, our method supports multiple applications of generating and interpolating cuboid representation, as well as structural clustering.

Figure 12. Shape generation results by randomly sampling a standard normal noise vector. The automatically generated shapes have reasonable and realistic cuboid structures with a large diversity.
Shape generation

Benefited from the VAE architecture, our network can accomplish the shape generation task by sampling the latent code from a standard normal distribution, which is neglected in previous unsupervised shape abstraction methods, such as VP and HA. Fig.12 shows a group of generated shapes, demonstrating the capability of our method in generating structurally diverse and plausible 3D shapes with cuboid representation.

Figure 13. Shape interpolation by linearly interpolating between the latent codes of two shapes. Continuous change in geometry and structure can be observed in the interpolated shapes from left to right.
Shape interpolation

Shape interpolation can be used to generate a smooth transition between two existing shapes. The effect of shape interpolation depends on whether the network can learn a smooth high-dimensional manifold in the latent space. We evaluate our network by interpolating between pairs of 3D shapes to demonstrate that our method learns a smooth manifold for interpolation. As illustrated in Fig.13, our shape abstraction network produces a smooth transition between two shapes with reasonable and realistic 3D structures. For example, the backrest gradually becomes smaller and the seat gets progressively thinner from left to right.

Structural shape clustering
Figure 14. Four groups of shapes of the table category using our structural clustering. We use the learned cuboid existence indicators as measurement and group shapes that have the same existence indicator together. From top to bottom, the serial numbers of appearing cuboids are [5,7,10,12,14], [5,7,9,10,12,14], [3,9,10], and [7,10,12] respectively.

Though the object geometries can vary dramatically within the same category, they often follow some specific structural design. Based on the learned structured cuboid representation, our method supports 3D shape clustering according to the common shape structure. We use the existence indicator vector of the cuboids learned by our abstraction network to represent the object structure. The shapes in which the same cuboids appear are considered to have the same structure. Fig. 14 shows four groups of structural clustering results of the table category. We can see that in each cluster, even though the geometric details vary greatly, the 3D shapes share a common structure.

4.5. Failure cases, limitation and future works

The robustness and effectiveness of our unsupervised shape abstraction method have been demonstrated by extensive experiments. It also has some limitations and fails in some special cases. Since our method takes the structure consistency between different shape instances in the same category into account, the instance with very unique structures may not be well reconstructed, such as the aircraft in Fig. 15 (a). Due to the uniform sampling of point clouds in 3D space, points sampled for fine structures are too scarce to provide sufficient geometric information, e.g. the table legs in Fig. 15 (b). Another failure case is caused by a small maximum number of cuboid . For shapes with excessive small parts, our method fails to precisely recover all the parts with limited . For the example in Fig. 15 (c), the six thin slats are grouped into two parts in order to keep semantic consistency with other chairs.

Figure 15. Three representative failure cases of our method.

Our framework can be improved in several directions in the future. Currently, our shape abstraction network is trained separately for each specific object category. A future direction is to encode multiple classes of shapes in one network simultaneously or to learn transferable features across categories. Another direction is integrating multiple geometric primitives to represent objects, or the primitives with stronger representational capability, such as polyhedra (Deng et al., 2020) and star-domain (Kawana et al., 2020). However, there should be a trade-off between representational capability and structural simplicity. Moreover, although the manually annotated dataset (Mo et al., 2019b) already contains rich relationships between parts, unsupervised relationship discovery among parts is still a challenging task, which can be further investigated.

5. Conclusions

In this paper, we introduce a method for unsupervised learning of cuboid-based representation for shape abstraction from point clouds. We take full advantage of shape co-segmentation to extract the common structure in an object category to mitigate the multiplicity and ambiguity of sparse point clouds. We demonstrate the superiority of our method on preserving structural and semantic consistency in cuboid shape abstraction and point cloud segmentation. Our generative network is also versatile in shape generation, point cloud segmentation, shape interpolation, and structural classification.

We thank the anonymous reviewers for their constructive comments. We are grateful to Dr. Xin Tong for his insightful suggestions on this project. This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61632006 and Grant 62076230; in part by the Fundamental Research Funds for the Central Universities under Grant WK3490000003; and in part by Microsoft Research Asia.


  • H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, and H. C. Wolf (1977) Parametric correspondence and chamfer matching: two new techniques for image matching. In

    Proceedings of the International Joint Conferences on Artificial Intelligence Organization (IJCAI)

    Cited by: §4.1.
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: An Information-Rich 3D Model Repository. Technical report Technical Report arXiv:1512.03012 [cs.GR]. Cited by: §4.1.
  • Z. Chen, A. Tagliasacchi, and H. Zhang (2020) BSP-net: generating compact meshes via binary space partitioning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §4.2, §4.2, Table 2.
  • Z. Chen, K. Yin, M. Fisher, S. Chaudhuri, and H. Zhang (2019) BAE-net: branched autoencoder for shape co-segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §4.2, Table 2.
  • B. Deng, K. Genova, S. Yazdani, S. Bouaziz, G. Hinton, and A. Tagliasacchi (2020) CvxNet: learnable convex decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.5.
  • A. Dubrovina, F. Xia, P. Achlioptas, M. Shalah, R. Groscot, and L. J. Guibas (2019) Composite shape modeling via latent space factorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • M. Gadelha, G. Gori, D. Ceylan, R. Mech, N. Carr, T. Boubekeur, R. Wang, and S. Maji (2020) Learning generative models of shape handles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • L. Gao, J. Yang, T. Wu, Y. Yuan, H. Fu, Y. Lai, and H. Zhang (2019) SDM-NET: deep generative network for structured deformable mesh. ACM Trans. Graph. (SIGGRAPH ASIA) 38 (6), pp. 243:1–243:15. Cited by: §2.
  • Y. Kawana, Y. Mukuta, and T. Harada (2020) Neural star domain as primitive representation. In Advances in Neural Information Processing Systems, Cited by: §4.5.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: §3.4.4, §3.
  • J. Li, C. Niu, and K. Xu (2020) Learning part generation and assembly for structure-aware shape synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
  • J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas (2017) GRASS: generative recursive autoencoders for shape structures. ACM Trans. Graph. (SIGGRAPH) 36 (4), pp. 52:1–52:14. Cited by: §1, §2.
  • C. Lin, L. Liu, C. Li, L. Kobbelt, B. Wang, S. Xin, and W. Wang (2020) SEG-mat: 3d shape segmentation using medial axis transform. IEEE Transactions on Visualization and Computer Graphics. Cited by: §2.
  • F. Liu and X. Liu (2020) Learning implicit functions for topology-varying dense 3d shape correspondence. In Advances in Neural Information Processing Systems, Cited by: §2.
  • K. Mo, P. Guerrero, L. Yi, H. Su, P. Wonka, N. Mitra, and L. Guibas (2019a) StructureNet: hierarchical graph networks for 3d shape generation. ACM Trans. Graph. (SIGGRAPH ASIA) 38 (6), pp. 242:1–242:19. Cited by: §1, §2.
  • K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su (2019b) PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.5.
  • C. Niu, J. Li, and K. Xu (2018) Im2Struct: recovering 3d shape structure from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • D. Paschalidou, A. O. Ulusoy, and A. Geiger (2019) Superquadrics revisited: learning 3d shape parsing beyond cuboids. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.2, Table 2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems, Cited by: §3.4.5.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
  • N. Schor, O. Katzir, H. Zhang, and D. Cohen-Or (2019) CompoNet: learning to generate the unseen by part synthesis and composition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • D. Smirnov, M. Fisher, V. G. Kim, R. Zhang, and J. Solomon (2020) Deep parametric shape predictions using distance fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • C. Sun, Q. Zou, X. Tong, and Y. Liu (2019) Learning adaptive hierarchical cuboid abstractions of 3d shape collections. ACM Trans. Graph. (SIGGRAPH ASIA) 38 (6), pp. 241:1—241:13. Cited by: §1, §2, Figure 6, §3, §4.1, §4.1, Table 1, Table 2.
  • S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik (2017) Learning shape abstractions by assembling volumetric primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Figure 6, §3, §4.1, §4.1, Table 1, Table 2.
  • H. Wang, N. Schor, R. Hu, H. Huang, D. Cohen-Or, and H. Huang (2018) Global-to-local generative model for 3d shapes. ACM Trans. Graph. (SIGGRAPH ASIA) 37 (6), pp. 214:1—214:10. Cited by: §2.
  • Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38 (5), pp. 146:1—146:12. Cited by: §3.1.
  • R. Wu, Y. Zhuang, K. Xu, H. Zhang, and B. Chen (2020) PQ-net: a generative part seq2seq network for 3d shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • Z. Wu, X. Wang, D. Lin, D. Lischinski, D. Cohen-Or, and H. Huang (2019) SAGNet: structure-aware generative network for 3d-shape modeling. ACM Trans. Graph. (SIGGRAPH) 38 (4), pp. 91:1–91:14. Cited by: §2.
  • J. Yang, K. Mo, Y. Lai, L. J. Guibas, and L. Gao (2020) DSM-net: disentangled structured mesh net for controllable generation of fine geometry. External Links: 2008.05440 Cited by: §1, §2.
  • L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas (2016) A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. (SIGGRAPH ASIA) 35 (6), pp. 210:1–210:12. Cited by: §1, §2, §4.2.
  • F. Yu, K. Liu, Y. Zhang, C. Zhu, and K. Xu (2019) PartNet: a recursive part decomposition network for fine-grained and hierarchical shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem (2017) 3d-prnn: generating shape primitives with recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.