1. Introduction
3D objects, especially manufactured objects, often have regularity in their external geometry. For example, chairs, typically contain regular parts such as the seat and back to achieve their functionality of stable placement. These design rules can be reflected on the object structure, including a specific part shape or the relationship among parts. Understanding objects from the structural perspective makes it easier to understand ”why they are designed this way” or ”why they look the way they do”. While early vision tasks tend to understand an object from a holistic geometric perspective, some recent works attempt to incorporate more structural information into 3D shape understanding or modeling tasks (Li et al., 2017; Mo et al., 2019a; Yang et al., 2020). Though these structurebased approaches have shown significant advantages, object structure, which is often hidden behind the geometry, is not easy to acquire. Many researchers have attempted to use manual annotations to attach structural information to the object shape (Yi et al., 2016; Mo et al., 2019b), due to the fact that humans can easily recognize the structure of objects. However, the automatic acquisition of structural representations for common objects is a challenging task.
In this paper, we focus on unsupervised shape abstraction, which dedicates to parsing objects into concise structural representation. Since the geometry of each part is much simpler than that of the whole, shape abstraction attempts to use simple geometric primitives to assemble an object. Previous works often employ data like volumetric (Tulsiani et al., 2017), watertight mesh (Sun et al., 2019), or signed distance field (Smirnov et al., 2020) as the supervision information, since spatial occupancy information can provide important guidance for the shape abstraction task. However, these data are not intuitive to acquire and require extra geometric processing.
On the other hand, point clouds are more similar to the raw scan data which is much easier to acquire by LiDAR or RGBD cameras. However, little work has been done on learning primitive representations only through point clouds, since discrete point clouds lack dense occupancy supervision. The sparsity of point clouds easily causes degeneration in structural representation. Fig. 2 shows a 2D example. Considering a point cloud sampled from the surface of a rectangle, the two cuboid representations match its surface geometry as we can divide these points into arbitrary groups and fit them with different cuboids separately. Thus polysemy is introduced. In this case, the shape abstraction task degenerates to surface fitting task with planes, and each primitive loses its structural properties.
Compared to shape abstraction which focuses more on geometry fitting, the cosegmentation task of point clouds is biased towards extracting the common structure of the whole dataset in order to prevent ambiguity and degeneration. Our main idea is to learn a general rule for assigning points to different cuboid primitives consistently among shapes, rather than directly based on the geometry proximity between the points and cuboids. Moreover, for the joint tasks of segmentation and structural abstraction without structural annotations, we design several easytoimplement loss functions that do not need any sampling operation on parametric surfaces.
Under the unsupervised learning framework, we propose a shape abstraction network based on a variational autoencoder (VAE) to embed the input point cloud into a latent code and decode it into a set of parameterized cuboids. In addition, we make full use of the highlevel feature embedding to exploit the pointcuboid correlation for the segmentation task. Benefited from the joint segmentation branch, although some supervisory information is lacking, our algorithm is able to extract finer object structures than previous methods and obtain consistent part segmentation among instances with diverse shapes in the same category. Based on the latent space embedding, our method not only supports abstract shape reconstruction from point clouds but also supports shape interpolation and virtual shape synthesis by sampling the latent shape code.
In summary, this paper makes the following key contributions:

We propose an unsupervised framework that entangles part segmentation and cuboidbased shape abstraction for point clouds.

A set of novel loss functions are designed to jointly supervise the two tasks without requiring manual annotations on parts or structural relationships.

In addition to shape abstraction and segmentation, our framework supports a range of applications, including shape generation, shape interpolation, and structural shape clustering.
2. related work
Manufactured objects always exhibit strong structural properties. Understanding the highorder structural representation of shapes has been a hot research topic in the field of geometric analysis. The main difficulty of this task is that structural properties are embedded within the geometric shape. Moreover, structure and geometry are intertwined and affect each other, making it challenging to decouple part structures through geometric shapes. In this section, we discuss the most relevant works on both supervised and unsupervised learning for object structures and shape cosegmentation.
Supervised learning for object structure
Some researchers have paid their attention to supervised learning for object structures from largescale datasets with manual structural annotations
(Yi et al., 2016; Mo et al., 2019b) or decomposed parts using traditional optimization algorithms (Zou et al., 2017). These approaches can be divided into three categories according to the way how object parts are organized. Approaches in a sequential manner (Zou et al., 2017; Wu et al., 2020)employ recurrent neural networks to encode and decode parts sequentially. In the parallel manner, parts are organized in a flat way
(Wang et al., 2018; Gao et al., 2019; Wu et al., 2019; Schor et al., 2019; Dubrovina et al., 2019; Li et al., 2020; Gadelha et al., 2020). Treebased manner has attracted a lot of attention recently. Different levels in the tree structure represent different granularity of parts and parentchild node pairs represent the inclusion relationship between parts. Most treebased methods require ground truth of the relationship between parts like adjacency and symmetry to construct the tree. A generative recursive neural network (RvNN) is designed to generate shape structures and trained with multiple phases for encoding shapes, learning shape manifolds, and synthesizing geometric details respectively
(Li et al., 2017). Niu et al. (2018) employ an RvNN to decode a global image feature in a binary tree organization recursively. A similar binary tree structure is also used for point cloud segmentation to utilize structural constraints across different levels (Yu et al., 2019). In order to encode more elaborate structural relationships between parts, StructureNet is introduced to integrate partlevel connectivity and interpart relationships hierarchically in a ary graph (Mo et al., 2019a). Yang et al. (2020) design a twobranch recursive neural network to represent geometry and structure in 3D shapes explicitly for shape synthesis in a controllable manner.Unsupervised structural modeling
On the other hand, many approaches use unsupervised learning for the structure parsing task assuming that the object geometry naturally reveals structural properties. One direction is unsupervised shape abstraction, which ensembles 3D shapes by geometric primitives while preserving consistent structures in a shape collection. Tulsiani et al. (2017) make the first attempt to apply neural networks for abstracting 3D objects with cuboids without part annotations. A coverage loss and a consistency loss are developed to encourage mutual inclusion of target objects and predicted shapes. Sun et al. (2019) propose an adaptive hierarchical cuboid representation for more compact and expressive shape abstraction. They construct multiple levels for cuboid generation in different granularities and use a cuboid selection module to obtain the optimal abstraction. Besides cuboids, representing 3D shapes with more types of parametric primitives has been studied recently. Smirnov et al. (2020) define a general Chamfer distance in an Eulerian version based on distance field, which allows abstraction of multiple parametric primitives. Superquadric representation is another option for enhancing the geometric expressiveness and is demonstrated easier to learn than the cuboid representation for curved shapes (Paschalidou et al., 2019). They only use point clouds as supervision by constructing a bidirectional Chamfer distance between the predicted primitives and the point cloud. However, it requires differentiable uniform sampling over the primitives, which is not easy. In this paper, we design a new formulation of singledirection pointtocuboid reconstruction loss that avoids sampling points on the cuboid surfaces.
Unsupervised segmentation
Some unsupervised object segmentation works can also be used to analyze object structures. Chen et al. (2019) treat cosegmentation as a shape representation learning problem. They learn multiple implicit field branches for representing individual parts of an object and preserving consistency of the segmented parts over the entire dataset. Aiming to learn the dense correspondence between 3D shapes in an unsupervised manner, a novel implicit function is proposed to measure the correspondence degree between points in different shapes and to obtain cosegmentation results (Liu and Liu, 2020). Lin et al. (2020) develop an efficient approach based on the medial axis transform to identify junctions between parts of a 3D shape for segmentation. However, this method does not consider the relationship among shapes like the common structure. In comparison, we jointly learn shape abstraction and segmentation in an unsupervised manner while preserving the structural consistency over different instances on the segmentation task.
3. Our Approach
Given a point cloud that contains 3D points of an object, our goal is to reconstruct a set of cuboids to concisely represent the 3D shape of the object. Similar to previous methods (Tulsiani et al., 2017; Sun et al., 2019), each cuboid
is parameterized into three vectors including a translation
, a quaternion representing 3D rotation , and a scale .For different objects in various shapes of the same category, we attempt to predict a fixed number of cuboids. The fixed order of cuboids naturally conveys the part correspondence in different instances of the same category. However, even within the same category of objects, the structure of each instance varies a lot. For example, some chairs have armrests while others do not. To fit different structures, we add an indicator for each cuboid to indicate whether this cuboid appears in an instance. In summary, we parameterize each cuboid as an D vector which includes the geometric properties and existence of the cuboid.
We employ a variational autoencoder (VAE) (Kingma and Welling, 2014) framework that embeds the input point cloud into a latent code and decodes the parameters of the cuboids using a decoder, as shown in the feature embedding network and shape abstraction network in Fig. 3. Meanwhile, we design a segmentation network that integrates an attention scheme to build the pointcuboid correlations for the point allocation for the cuboids. To train the encoderdecoder network without manual annotations on partbased structures, we jointly learn the shape abstraction branch and the segmentation branch by enhancing the consistency between the part segmentation and reconstructed cuboids through a set of specially designed losses.
3.1. Feature embedding network
The feature embedding network maps the input point cloud into a latent code
that follows a Gaussian distribution. We first extract
pointwise features with that contains two EdgeConv layers (Wang et al., 2019), and each layer yields a D feature for each point in each level. By concatenating these two features together, the point feature is obtained. Then we extract a D global feature by feeding thepoint features into a fullyconnected layer with maxpooling
(Qi et al., 2017).As a generative model, we map the global feature to a latent space. We design two branches composed of fully connected layers to predict the mean
of the Gaussian distribution of the latent variable, respectively. Then the latent code is obtained by reparameterizing a random noisethat follows to a standard normal distribution:
(1) 
where represents elementwise multiplication.
3.2. Shape abstraction network
The shape abstraction subnetwork decodes the latent code into the parameters for cuboids. In order to retain the highlevel structure information with part correspondence, we first infer cuboidrelated features respectively by subbranches for each cuboid from the latent code . In the th branch, we embed a onehot vector where for and for with a cuboid code encoder to obtain a D embedding vector for the th cuboid. We concatenate the cuboid embedding vector with the latent code and then fed into a cuboid feature encoder to obtain a D cuboid feature . With the fixed order of cuboids and the onehot cuboid codes, the part correspondences are preserved implicitly in the network so that the decoder not only contains information about the geometric shape information of the corresponding cuboids but also embeds the structure of a specific part in the object.
Then, each passes through a cuboid parameter prediction module
to estimate the geometric parameters and existence probability for each cuboid. Note that the feature encoders
, , and the cuboid decoder in each cuboid branch are all composed of fully connected layers and share parameters among all cuboids.3.3. Cuboidassociated segmentation network
The segmentation branch allocates each point in the input point cloud to cuboids. It means that we perform label point cloud segmentation, where each label corresponds to a cuboid, which is a potential part under the common structure of an object category. We use two fullyconnected layers to reduce the dimension of point features and the cuboid features to D feature vectors and
respectively. Treating these point features and cuboid features as the query and key in an attention scheme, we compute the affinity matrix
between points and cuboids as(2) 
Then a softmax operation is performed on each row to obtain the probability distribution that the point
belongs to the part that the cuboid represents. Thus, we get a probability matrix(3) 
From this probabilistic pointtocuboid allocation matrix, we can simply obtain the segmentation result of a point cloud using argmax.
3.4. Loss functions
In order to train our network without groundtruth segmentation or cuboid shape representations, we design several novel losses between the results of the segmentation and abstraction branches to enforce the geometric coherence and structure compactness. More specifically, we design a reconstruction loss between the segmentation and cuboid abstraction. A compactness loss is designed to encourage the network to learn a more compact structural representation based on the pointcuboid allocation in the segmentation branch. The cuboid existence loss is designed to predict the existence indicator for each cuboid. To enable the capability of shape generation, a KLdivergence loss is designed to enforce the latent code to follow a standard Gaussian distribution.
3.4.1. Reconstruction loss
While no groundtruth cuboid annotations can be obtained in our unsupervised framework, the segmentation branch is utilized to provide a probabilistic part assignment for the input points. The reconstruction loss is expected not only to minimizing the local geometric distance but also encourages consistent highorder part allocations. We calculate the distance between each point of the input point cloud and each predicted cuboid and sum them with the probabilistic assignment predicted by the segmentation network as the shape reconstruction loss:
(4) 
This loss function tends to reduce the geometric deviation for a point to the cuboid with high weights . In other words, this loss measures the compatibility of the segmentation branch with the abstraction branch. The shape parameters of a particular cuboid are optimized according to the weighted assignment , while the pointcuboid allocation probability is adjusted according to the geometric shape of the cuboids. Through this loss, we jointly optimize the cuboid parameters and the cosegmentation map.
Note that Eq. (4) is actually a weighted singledirection distance from the point cloud to the cuboids. Compared with the bidirectional Chamfer distance, it does not require a differentiable sampling operation on the cuboid surface. Moreover, compared with training with Chamfer distance where points are assigned to the cuboids based on the distance between them, this formulation allows our model to jointly learn the cuboid assignment explicitly, making it less likely stuck in a local minimum of geometric optimization.
However, singledirection distances in general lead to model degeneration. We introduce normal information for the reconstruction loss to prevent degradation. Instead of calculating the distance from to the closest cuboid plane, we calculate the distance from to the cuboid surface with the most similar normal direction as . On the other hand, in order to emphasize the normal similarity and enhance the robustness to noises in point clouds, we introduce an additional sampling strategy when computing the distance. We sample a new point along the normal direction of with a random distance from a Gaussian distribution . For the point , we look for its nearest point on the selected cuboid surface and define the distance from point to a cuboid as
(5) 
Fig. 4 illustrates the distance definition considering both normal similarity and point sampling along the normal distribution. Under this definition, only when lies on the surface of the cuboid with same normal. Unless specifically mentioned, is setting to 0.05 in our experiments.
3.4.2. Cuboid compactness loss
Typically, there exist multiple combinations of cuboids to represent an object. More cuboids tend to result in more accurate shape approximation. However, the structure of an object is expected to be concise and clear. Thus, a small number of cuboids is preferred for shape abstraction. We design a cuboid compactness loss to penalize a large number of cuboids.
In the semantic segmentation task, when there is no label of a category in the segmentation result, one can consider that the object of that category does not appear. Therefore, we impose a sparsity loss function on the segmentation result to reduce the number of cuboids used. From our pointcuboid allocation probability matrix , we can compute the portion of how many points likely to be allocated to each cuboid as . Analogously, if there are no points likely to be assigned to a certain cuboid , i.e. , we regard the structure represented by the cuboid as absent in the 3D shape. Therefore, we compute the compactness of the shape abstraction directly from for each cuboid. Though loss is typically used to achieve sparseness, in our case as defined in Eq. (3), the optimization process of loss does not update, as illustrated in Fig. 5. Instead, we adopt norm to compute the compactness loss as:
(6) 
The small constant is added to prevent gradient explosion when a cuboid has no points allocated during training.
3.4.3. Cuboid existence loss
As mentioned in the parametric representation of the cuboids, we predict the existence of a certain cuboid so that we can allow a varied number of cuboids to represent the 3D shape for different instances. On the other hand, the pointcuboid segmentation branch can naturally handle a varying number of cuboids. Therefore, we consider the portion of points allocated to each cuboid after segmentation as its existence ground truth. We set a threshold so that when the number of points allocated to a cuboid by the allocation matrix , we consider the cuboid existence , otherwise . In our experiments, we set .
We use binary crossentropy between the predicted existence indicator for the cuboid in the shape abstraction subnetwork and the approximated ground truth as the cuboid existence loss:
(7) 
3.4.4. Latent code KLdivergence loss
As in the vanilla VAE model (Kingma and Welling, 2014), we also assume that the D latent code conforms to a standard Gaussian distribution with the assumption of independence of each dimension. We use KL divergence as the loss function for distribution constraints:
(8) 
This VAE module with KL divergence loss supports shape generation and manipulation applications at a highorder structure level.
3.4.5. Network training
We train our network endtoend with the total loss
(9) 
We set , and
in training. The network is implemented in the PyTorch framework
(Paszke et al., 2019). All the experiments were conducted on a NVIDIA Geforce GTX1080Ti GPU. We train each category separately. The biases for the convolutional layers for predicting the cuboid scales and translations are all initialized to 0, and the one to predict rotation quaternions is initialized to, while all the other parameters of the network are randomly initialized. We use the Adam optimizer for a total of 1000 epochs, setting the batch size 32 and the learning rate
.4. Experiments and Analysis
Our joint shape abstraction and segmentation network obtains not only accurate structural representations but also highly consistent point cloud segmentation results. In this section, we mainly evaluate our method for the shape reconstruction task and the point cloud cosegmentation task and demonstrate its superiority compared to other shape abstraction methods. Then we conduct ablation studies to verify the effectiveness of each component in our approach.
4.1. Structured shape reconstruction
Method  Airplane  Chair  Table  Animal 

VP (Tulsiani et al., 2017)  0.725  1.006  1.525  1.896 
HA (Sun et al., 2019)  0.713  1.109  1.449  0.575 
Ours  0.329  0.399  0.848  0.350 
We first evaluate the shape reconstruction performance of our method and provide quantitative and qualitative comparisons to the previous cuboidbased shape abstraction methods.
For quantitative reconstruction evaluation, we use four categories of shapes: airplane (3640), chair (5929), table (7555) from ShapeNet dataset (Chang et al., 2015) and fourlegged animal (129) from (Tulsiani et al., 2017). We divide the data into training data and test data in a 4:1 ratio same as (Sun et al., 2019). Due to the limited training data of the animal category, data augmentation with rotation by and around the axis for each model is performed. For the four categories of shapes, we set to 20, 16, 14, and 16, respectively. All shapes are prealigned and normalized to the unit scale.
We compare our method to two stateoftheart cuboidbased shape abstraction methods: VP (Tulsiani et al., 2017) and HA (Sun et al., 2019). To evaluate the reconstruction performance, we adopt the commonlyused Chamfer Distance (CD) (Barrow et al., 1977) between two point sets. The predicted point set is obtained by uniformly sampling over the predicted model composed of parametric cuboids. In our experiments, we use symmetric CD and evaluate on 4096 points sampled from the predicted model and the input point cloud. Table 1 shows the quantitative comparison of our method with other cuboidbased shape abstraction methods. Our method outperforms the two stateoftheart methods on the four object categories, demonstrating its capability for better geometric reconstruction by understanding various object structures. A group of reconstructed results using the three methods are shown in Fig. 6. Both VP and HA methods have difficulty in capturing fine object structures, such as the armrests of chairs and the connection bars between the table legs. In particular, VP tends to generate underpartitioned models. On the other hand, due to the way of multilevel abstraction and selection in HA, some thin structures are forcibly divided into multiple small parts, such as chair legs. In comparison, our method is able to extract more concise and precise results.
4.2. Shape cosegmentation
mIOU  Airplane  Chair  Table 

VP (Tulsiani et al., 2017)  37.6  64.7  62.1 
SQ (Paschalidou et al., 2019)  48.9  65.6  77.7 
HA (Sun et al., 2019)  55.6  80.4  67.4 
BAENet (Chen et al., 2019)  61.1  65.5  87.0 
BSPNet (Chen et al., 2020)  74.5  82.1  90.3 
Ours  64.2  82.0  89.2 
Though mainly designed for cuboid abstraction, our network also supports point cloud segmentation with part correspondences in one object category. In this section, we compare a number of shape decomposition networks on the task of unsupervised point cloud segmentation. In addition to VP and HA approaches, we also compare with three stateoftheart shape decomposition networks, Super Quadrics (SQ) (Paschalidou et al., 2019), Branched Auto Encoders (BAE) (Chen et al., 2019), and BSPNet (BSP) (Chen et al., 2020). SQ uses superquadrics as geometric primitives, which have more flexibility than the cuboid representation. BAE, on the other hand, uses the distance field as the base representation and generates a complex object jointly by predicting multiple relatively simple implicit fields. BSP is based on the binary spacepartitioning algorithm that cuts the 3D space with multiple planes to depict the surface of a shape. The polyhedra composed of multiple planes can be used as primitives to represent object shapes. Note that a variety of training methods are introduced in BAE, while we choose to compare with its unsupervised baseline.
We conducted a quantitative comparison on three categories of shapes: airplane (2690), chair (3758), and table (5271) in the ShapeNet part dataset (Yi et al., 2016). Since we perform structural shape modeling, following (Chen et al., 2020), we reduce the original semantic annotation in the dataset from (top, leg, support) to (top, leg) for the table category by merging the ‘support’ label with ‘leg’. We adopt the mean perlabel Intersection Over Union (mIOU) as the evaluation criterion for the segmentation task. Since the segmentation branch of our network does not refer to any semantics, we assign semantic annotations to each geometric primitive as following. We first randomly take out 20% of the shapes in the dataset, count the number of points belonging to a ground truth annotation for each primitive in the segmentation branch, and finally assign the ground truth annotation with the highest number of points as the label for that primitive. Afterward, these labels are transferred to the whole test set.
The quantitative segmentation results on the cosegmentation task are compared in Table 2. Our method achieves the best results within the cuboidbased approaches (VP, SQ, and HA) and ranks second among all the methods, demonstrating that our method generates segmentation results with higher semantic consistency. Fig. 7 shows the segmentation results of our network. In addition to the above three categories, we also show segmentation results on the animal category. Notice that our point cloud segmentation results are consistent with shape abstraction, which proves the effectiveness of our joint learning. Our method is able to subtly partition fine shape details, such as the engine of airplanes and the connection structure between the seat and the backrest. Moreover, despite we adopt the cuboid representation, our network handles noncubic structures well, for example, the fuselage of airplanes, the backrest and cushions of a sofa, and the animals. In addition, it can be seen that our segmentation results exhibit a strong semantic consistency by using the same cuboid to express the same structure in different instances.
4.3. Ablation study and analysis
Our key idea is to learn the cuboid abstraction and shape segmentation jointly for mutual compensation. To this end, we design a twobranch network with several loss functions to promote compatibility between them. In this section, we disentangle our network and analyze the effect of each loss function by a group of ablation experiments.
Role of the segmentation module.
To train a shape abstraction model in an unsupervised manner, the most important thing is to assign parts of the input point cloud into corresponding primitives. A reasonable part allocation will facilitate the learning process. We explicitly learn the allocation weights in Eq. 3 by the segmentation branch. To verify its effectiveness for unsupervised learning, we first remove the segmentation branch and directly assign points to its closest cuboid, i.e. when is the closest cuboid of , otherwise . This variant is denoted as P2CDis. Furthermore, we also train a model (ChamferDis) without the segmentation branch but using the bidirectional Chamfer distance as reconstruction loss, as most previous unsupervised methods do. For a fair comparison, we train the model with the segmentation branch using the reconstruction loss only, denoted as P2CSeg.
We compared the reconstruction results of the three variants in Table 3, which shows that the P2CSeg outperforms the other two variants on three categories. In Fig. 8, we visualize the part allocation results of the above three methods. For P2CSeg, we directly visualize the segmentation results predicted by the segmentation branch. For ChamferDis and P2CDis, as they assign a point to its closest cuboid primitive for computing reconstruction loss, we attach the label of the nearest cuboid to each point. It shows that the part allocation results of P2CDis and ChamferDis are more scattered than those of P2CSeg, leading to stacked cuboids. In addition, since it is difficult to adjust the hard assignment based on geometric distance, the P2CDis and ChamferDis models usually get stuck in local minima. For example, in the right column of Fig. 8, the four chair legs are stuck into one large cuboid rather than four small cuboids. In contrast, the part allocations of P2CSeg are more explicit so that more compact abstraction results are achieved.
Category  P2CDis  ChamferDis  P2CSeg 

Airplane  4.836  0.286  0.237 
Chair  3.356  0.834  0.436 
Table  3.335  0.992  0.846 
Pointtocuboid distance.
The implementation of the pointtocuboid distance in Eq. (4) greatly affects the network optimization process. We make two designs on projection manner and random sampling to prevent model degeneration and enhance robustness to shape noises in . Next, we will disentangle these two designs and analyze their impacts respectively.
An intuitive way of computing is to project a point to its closest cuboid surface. However, as shown in Fig. 9, there exists a degenerate solution since the surface normal consistency is ignored. This distance can be sufficiently small as long as there is at least one cuboid surface near each point. In contrast, the projection manner according to the normal direction not only ensures that the cuboid surface to be optimized has a similar surface normal with the point but also helps this normalsimilar cuboid surface become the closest surface as the optimization proceeds.
While the projection manner selects which cuboid surface to compute the pointtocuboid distance, we do not directly use the shortest distance from a point to a surface. Instead, we randomly sample a point , project it to the selected cuboid surface to find its closest point . The distance  only when lies on the surface of which has the same normal direction with . It enforces the network to emphasize geometric distance as well as normal similarity without increasing the computational complexity. In contrast, the bidirectional Chamfer distance used in previous methods requires double calculation and sampling on cuboid surfaces.
Moreover, this random sampling also improves the robustness to noisy point clouds. We add Gaussian noises to the input clouds with varying variance
and compare the reconstruction quality using different pointtocuboid distances in Table 4. The models trained with minimumdistance projection appear degenerate, while the models with normalsimilar projection achieve satisfactory reconstruction accuracy. The random sampling () drives the network to pay more attention to surface orientations, where the normal consistency is generally higher than those without point sampling along the normal direction (). In addition, the models trained with normal sampling show better robustness to noises. While increases, the reconstruction quality using normal sampling during training is also generally better than those without sampling.Distance  Normal  

0.00  56.042/0.451  0.386/0.758  0.389/0.762  0.399/0.763 
0.01  57.371/0.448  0.463/0.732  0.480/0.730  0.442/0.736 
0.02  57.450/0.465  0.626/0.667  0.598/0.679  0.572/0.671 
0.03  58.734/0.461  0.909/0.641  0.727/0.645  0.754/0.652 
Compactness loss.
The compactness loss is designed to penalize redundant cuboids. We adjust the weight exponentially for the compactness loss and analyze the reconstruction results. In Table 5, we report the average number of cuboids () used in the shape abstraction results for all the instances, reconstruction quality (CD), and mIOU for part segmentation. As increases, gradually decreases as analyzed in Sec. 3.4.2 and the CD increases due to the limited cuboids to represent the shape. Fig. 10 shows the abstraction results of three shapes with different weights . When (b), the network is freely optimized without constraint on the number of cuboids, leading to overdisassembled shapes with redundant cuboids. From (a) to (d), the increasing leads to more concise structural representation. When the , the network is too stingy with the use of cuboids resulting in the loss of some structure, which also is reflected in the CD and mIOU in Table 5. On the other side, the abstraction shapes are structurally consistent among multiple instances of the category under different settings of . In each column, the common parts of different instances, such as the back, the seat, and four legs, are consistently represented.
0.00  0.05  0.10  0.20  

15.486  11.144  9.753  5.544  
CD  0.381  0.385  0.399  1.452 
mIOU  81.2  82.1  82.0  72.5 
Choice of the cuboid number
In our experiments, we set the number of cuboids empirically for different categories. To evaluate the sensitivity of our method to , we change while fix all the other hyperparameters to train our model. In Table 6, we report the average Chamfer distance and the average number of used cuboid in the abstraction results under different for three categories. As expected, the CD decreases as increases since more cuboids can be used to deal with diverse structures and better fit fine shapes. Actually, decreasing with fixed weight is analogous to increasing with fixed when training our network. Similar abstraction results can be obtained with Fig. 10.
Category  8  12  16  20  24  

Airplane  CD  0.849  0.484  0.394  0.329  0.285 
4.171  5.663  9.801  11.801  13.720  
Chair  CD  1.381  0.501  0.399  0.385  0.366 
5.976  7.150  9.753  10.130  12.158  
Table  CD  1.170  1.062  0.826  0.776  0.657 
5.636  7.328  8.524  9.747  12.573 
Robustness on sparse point clouds.
To verify the robustness of our framework to the density of input point clouds, we train our model with different numbers of points, including 256, 1024, and 4096, on the chair category. In Fig. 11, we show the abstraction and segmentation results under different point cloud densities for the same shape. It shows that our method produces reasonable and consistent structural representation for the same shape with various point densities.
4.4. Applications
Based on our network architecture, our method supports multiple applications of generating and interpolating cuboid representation, as well as structural clustering.
Shape generation
Benefited from the VAE architecture, our network can accomplish the shape generation task by sampling the latent code from a standard normal distribution, which is neglected in previous unsupervised shape abstraction methods, such as VP and HA. Fig.12 shows a group of generated shapes, demonstrating the capability of our method in generating structurally diverse and plausible 3D shapes with cuboid representation.
Shape interpolation
Shape interpolation can be used to generate a smooth transition between two existing shapes. The effect of shape interpolation depends on whether the network can learn a smooth highdimensional manifold in the latent space. We evaluate our network by interpolating between pairs of 3D shapes to demonstrate that our method learns a smooth manifold for interpolation. As illustrated in Fig.13, our shape abstraction network produces a smooth transition between two shapes with reasonable and realistic 3D structures. For example, the backrest gradually becomes smaller and the seat gets progressively thinner from left to right.
Structural shape clustering
Though the object geometries can vary dramatically within the same category, they often follow some specific structural design. Based on the learned structured cuboid representation, our method supports 3D shape clustering according to the common shape structure. We use the existence indicator vector of the cuboids learned by our abstraction network to represent the object structure. The shapes in which the same cuboids appear are considered to have the same structure. Fig. 14 shows four groups of structural clustering results of the table category. We can see that in each cluster, even though the geometric details vary greatly, the 3D shapes share a common structure.
4.5. Failure cases, limitation and future works
The robustness and effectiveness of our unsupervised shape abstraction method have been demonstrated by extensive experiments. It also has some limitations and fails in some special cases. Since our method takes the structure consistency between different shape instances in the same category into account, the instance with very unique structures may not be well reconstructed, such as the aircraft in Fig. 15 (a). Due to the uniform sampling of point clouds in 3D space, points sampled for fine structures are too scarce to provide sufficient geometric information, e.g. the table legs in Fig. 15 (b). Another failure case is caused by a small maximum number of cuboid . For shapes with excessive small parts, our method fails to precisely recover all the parts with limited . For the example in Fig. 15 (c), the six thin slats are grouped into two parts in order to keep semantic consistency with other chairs.
Our framework can be improved in several directions in the future. Currently, our shape abstraction network is trained separately for each specific object category. A future direction is to encode multiple classes of shapes in one network simultaneously or to learn transferable features across categories. Another direction is integrating multiple geometric primitives to represent objects, or the primitives with stronger representational capability, such as polyhedra (Deng et al., 2020) and stardomain (Kawana et al., 2020). However, there should be a tradeoff between representational capability and structural simplicity. Moreover, although the manually annotated dataset (Mo et al., 2019b) already contains rich relationships between parts, unsupervised relationship discovery among parts is still a challenging task, which can be further investigated.
5. Conclusions
In this paper, we introduce a method for unsupervised learning of cuboidbased representation for shape abstraction from point clouds. We take full advantage of shape cosegmentation to extract the common structure in an object category to mitigate the multiplicity and ambiguity of sparse point clouds. We demonstrate the superiority of our method on preserving structural and semantic consistency in cuboid shape abstraction and point cloud segmentation. Our generative network is also versatile in shape generation, point cloud segmentation, shape interpolation, and structural classification.
Acknowledgements.
We thank the anonymous reviewers for their constructive comments. We are grateful to Dr. Xin Tong for his insightful suggestions on this project. This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61632006 and Grant 62076230; in part by the Fundamental Research Funds for the Central Universities under Grant WK3490000003; and in part by Microsoft Research Asia.References

Parametric correspondence and chamfer matching: two new techniques for image matching.
In
Proceedings of the International Joint Conferences on Artificial Intelligence Organization (IJCAI)
, Cited by: §4.1.  ShapeNet: An InformationRich 3D Model Repository. Technical report Technical Report arXiv:1512.03012 [cs.GR]. Cited by: §4.1.

BSPnet: generating compact meshes via binary space partitioning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §4.2, §4.2, Table 2.  BAEnet: branched autoencoder for shape cosegmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §4.2, Table 2.
 CvxNet: learnable convex decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.5.
 Composite shape modeling via latent space factorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 Learning generative models of shape handles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 SDMNET: deep generative network for structured deformable mesh. ACM Trans. Graph. (SIGGRAPH ASIA) 38 (6), pp. 243:1–243:15. Cited by: §2.
 Neural star domain as primitive representation. In Advances in Neural Information Processing Systems, Cited by: §4.5.
 Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: §3.4.4, §3.
 Learning part generation and assembly for structureaware shape synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
 GRASS: generative recursive autoencoders for shape structures. ACM Trans. Graph. (SIGGRAPH) 36 (4), pp. 52:1–52:14. Cited by: §1, §2.
 SEGmat: 3d shape segmentation using medial axis transform. IEEE Transactions on Visualization and Computer Graphics. Cited by: §2.
 Learning implicit functions for topologyvarying dense 3d shape correspondence. In Advances in Neural Information Processing Systems, Cited by: §2.
 StructureNet: hierarchical graph networks for 3d shape generation. ACM Trans. Graph. (SIGGRAPH ASIA) 38 (6), pp. 242:1–242:19. Cited by: §1, §2.
 PartNet: a largescale benchmark for finegrained and hierarchical partlevel 3D object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.5.
 Im2Struct: recovering 3d shape structure from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 Superquadrics revisited: learning 3d shape parsing beyond cuboids. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.2, Table 2.

PyTorch: an imperative style, highperformance deep learning library
. In Advances in Neural Information Processing Systems, Cited by: §3.4.5.  PointNet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
 CompoNet: learning to generate the unseen by part synthesis and composition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 Deep parametric shape predictions using distance fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
 Learning adaptive hierarchical cuboid abstractions of 3d shape collections. ACM Trans. Graph. (SIGGRAPH ASIA) 38 (6), pp. 241:1—241:13. Cited by: §1, §2, Figure 6, §3, §4.1, §4.1, Table 1, Table 2.
 Learning shape abstractions by assembling volumetric primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Figure 6, §3, §4.1, §4.1, Table 1, Table 2.
 Globaltolocal generative model for 3d shapes. ACM Trans. Graph. (SIGGRAPH ASIA) 37 (6), pp. 214:1—214:10. Cited by: §2.
 Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38 (5), pp. 146:1—146:12. Cited by: §3.1.
 PQnet: a generative part seq2seq network for 3d shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 SAGNet: structureaware generative network for 3dshape modeling. ACM Trans. Graph. (SIGGRAPH) 38 (4), pp. 91:1–91:14. Cited by: §2.
 DSMnet: disentangled structured mesh net for controllable generation of fine geometry. External Links: 2008.05440 Cited by: §1, §2.
 A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. (SIGGRAPH ASIA) 35 (6), pp. 210:1–210:12. Cited by: §1, §2, §4.2.
 PartNet: a recursive part decomposition network for finegrained and hierarchical shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 3dprnn: generating shape primitives with recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.