3D-PRNN: Generating Shape Primitives with Recurrent Neural Networks

The success of various applications including robotics, digital content creation, and visualization demand a structured and abstract representation of the 3D world from limited sensor data. Inspired by the nature of human perception of 3D shapes as a collection of simple parts, we explore such an abstract shape representation based on primitives. Given a single depth image of an object, we present 3D-PRNN, a generative recurrent neural network that synthesizes multiple plausible shapes composed of a set of primitives. Our generative model encodes symmetry characteristics of common man-made objects, preserves long-range structural coherence, and describes objects of varying complexity with a compact representation. We also propose a method based on Gaussian Fields to generate a large scale dataset of primitive-based shape representations to train our network. We evaluate our approach on a wide range of examples and show that it outperforms nearest-neighbor based shape retrieval methods and is on-par with voxel-based generative models while using a significantly reduced parameter space.


page 3

page 7

page 13

page 14


Multi-chart Generative Surface Modeling

This paper introduces a 3D shape generative model based on deep neural n...

Unsupervised Primitive Discovery for Improved 3D Generative Modeling

3D shape generation is a challenging problem due to the high-dimensional...

3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces

3D Shape representation has substantial effects on 3D shape reconstructi...

Learning to Generate 3D Shapes from a Single Example

Existing generative models for 3D shapes are typically trained on a larg...

Learning Shape Abstractions by Assembling Volumetric Primitives

We present a learning framework for abstracting complex shapes by learni...

Neural Shape Parsers for Constructive Solid Geometry

Constructive Solid Geometry (CSG) is a geometric modeling technique that...

Cuboids Revisited: Learning Robust 3D Shape Fitting to Single RGB Images

Humans perceive and construct the surrounding world as an arrangement of...

1 Introduction

Many robotics and graphics applications require 3D interpretations of sensory data. For example, picking up a cup, moving a chair, predicting whether a stack of blocks will fall, or looking for keys on a messy desk all rely on at least a vague idea of object position, shape, contact and connectedness. A major challenge is how to represent 3D object geometry in a way that (1) can be predicted from noisy or partial observations; and (2) is useful for reasoning about contact, support, extent, and so on. Recent efforts often focus on voxelized volumetric representations (e.g., [44, 43, 14, 9]). Instead, we propose to represent objects with 3D primitives (oriented 3D rectangles, i.e. cuboids). Compared to voxels, the primitives are much more compact, for example 45-D for 5 primitives parameterized by scale-rotation-translation vs 32,256-D for a 32x32x32 voxel grid. Also, primitives are holistic — representing an object with a few parts greatly simplifies reasoning about stability, connectedness, and other important properties. Primitive-based 3D object representations have long been popular in psychology (e.g. “geons” by Biederman [3]) and interactive graphics (e.g. “Teddy” [19]

), but they are less commonly employed in modern computer vision due to the challenges of learning and predicting models that consist of an arbitrary number of parameterized components.

Figure 1: A step-by-step primitive-based shape generation by 3D-PRNN. As an illustration, given single depth image, we sequentially predicts sets of primitives that form the shape. Each time we randomly sample one primitive from a set and generate the next set of primitives conditioning on the current sample.
Figure 2: 3D-PRNN overview

. We illustrate the method on the task of single depth shape completion. The network starts from encoding the input depth image into a feature vector, which is then sent to the ”recurrent generator” consisting stacks of Long Short-Term Memory (LSTM) and a Mixture Density Network (MDN). At each time step, the network predicts a set of primitives conditioned on both the depth feature and the previously sampled single primitive. The final reconstruction result and ground truth are shown on the right.

Our goal is to learn 3D primitive representations of objects from unannotated 3D meshes. We follow an encoder-decoder strategy, inspired by recent work [15, 40], using a recursive neural network (RNN) to encode an implicit shape representation and then sequentially generate primitives to approximate the shape as shown in Fig. 1. One challenge in training such a primitive generation network is acquiring ground truth data for primitive-based shape representations. To address this challenge, we propose an efficient method based on Gaussian Fields and energy minimization [6]

to iteratively parse shapes into primitive components. We optimize a differentiable loss function using robust techniques (L-BFGS

[47]). We use this (unsupervised) optimization process to create the primitive ground truth, solving for a set of primitives that approximates each 3D mesh in a collection. The RNN can then be trained to generate new primitive-based shapes that are representative of an object class’ distribution or to complete an object’s shape given a partial observation such as a depth image or point cloud.

To model shapes, we propose 3D-PRNN, an RNN-based generative model that predicts context-sensitive sequences of primitives in object-centric coordinates, as shown in Figure 2. To predict shape from depth, the network is trained jointly with the single depth image and a sequence of primitives configurations (shape, translation and rotation) that form the complete shape. During testing, the network gets input of a depth map and sequentially predicts primitives (ending with a stop signal) to reconstruct the shape. Our generative RNN architecture is based on a Long Short-Term Memory (LSTM) and a Mixture Density Network (MDN).

We evaluate our proposed generative model by comparing with baselines and the state-of-the-art methods. We show that, even though our method has less degrees of freedom in representation, it achieves comparable accuracy with voxel based reconstruction methods. We also show that encoding further symmetry and rotation axis constraints in our network significantly boosts performance.

Our main contributions are:

  • We propose 3D-PRNN: a generative recurrent neural network that reconstructs 3D shapes as sequences of primitives given a single depth image.

  • We propose an efficient method to fit primitives from point clouds based on Gaussian-fields and energy minimization. Our primitive representation provides enough training samples for 3D-PRNN in 3D reconstruction.

2 Related Work

Primitive-based Shape Modeling: Biederman, in the early 1980s, popularized the idea of representing shapes as a collection of components or primitives called “geons” [3], and early computer vision algorithms attempted to recover object-centered 3D volumetric primitives from single images [10]. In computer aided design, primitive-based shape representations are used for 3D scene sketches [46, 33], shape completion from point clouds [35, 26, 34]. In the case that scans of shapes often have canonical parts like planes or boxes and efficient solution for large data is required, primitives are used in reconstructions of urban and architectural scenes [7, 25, 5, 37]. Recently, more compact and parametric representations in the form of template objects [42], and set of primitives [39] have been introduced. These representations, however, require non trivial effort to accommodate variable number of configurations within the object class they are trained for. This is mainly because of their single feed-forward design, which implicitly forces the prediction of a discrete number of variables at the same time.

Figure 3: Sample primitive fitting result. We show our primitive fitting results on chairs, tables and sofas. We overlay our fitted primitives on the sampled 3D point clouds of each shape.
Figure 4: Failure cases. Main causes are : too complex shape details to be represented by primitive blocks (left), The smoothing property of Gaussian force fields is not good at describing small hollow shape (middle), small cluster of point clouds are easily missed through our randomized search scheme (middel and right).

Object 3D shape reconstruction can be attempted given an RGB image  [44, 14, 1, 9]or depth image [43, 31, 12] Recently proposed representations and prediction approaches for 3D data in the context of prediction from sensory input have mainly either focused on part- and object-based retrieval from large repositories [31, 1, 27, 21], or voxelized volumetric representations [44, 43, 14, 9]. A better model fitting includes part deformation [8] and symmetry [24]. Wu et al. [43]

present preliminary results on automatic shape completion from depth by classifying hidden voxels with a deep network. Wu et al. 

[42] reconstruct shapes based on predicted skeletons. Unlike mesh-based or voxel-based shape reconstruction, our method predicts shapes with aggregations of primitives that has the benefit for lower computational and storage cost.

Generative Models with RNNs: Graves [15] uses Long Short-term Memory recurrent neural networks to generate complex sequences of text and online handwriting. Gregor et al. [16] combine LSTM and a variational auto-encoder, called the Deep Recurrent Attentive Writer (DRAW) architecture, for image generation. The DRAW architecture is a pair of RNNs with an encoder network that compresses the real images presented during training, and a decoder that reconstitutes images after receiving codes. Rezende et al. [20] extend DRAW to learn generative models of 3D structures and recover this structure from 2D images via probabilistic inference. Our 3D-PRNN, which sequentially generates primitives, is inspired by Graves’ work to sequentially generate parameterized handwriting strokes and the PixelRNN approach [40] to model natural images as sequentially generated pixels. To produce parameterized 3D primitives (oriented cuboids), we customize the RNN to encode explicit geometric constraints of symmetry and rotation. For example, separately predicting whether a primitive should rotate along each axis and by how much improves results over more simply predicting rotation values, since many objects consist of several (unrotated) cuboids.

3 Fitting Primitives from Point Clouds

One challenge in training our 3D-PRNN primitive generation network is the lack of large scale ground truth primitive based shape reconstruction data. We propose an efficient method to generate such data. Given a point cloud representation of a shape, our approach finds the most plausible primitives to fit in a sequential manner, e.g. given a table, the algorithm might identify the primitive that fits to the top surface first and then the legs successively. We use rectangular cuboids as primitives which provide a plausible abstraction for most man-made objects. Our method proposes a fast parsing solution to decompose shapes with varying complexity into a set of such primitives.

3.1 Primitive Fitness Energy

We formulate the successive fitting of primitives as an energy minimization scheme. While primitive fitting at each step resembles the method of Iterative Closest Point (ICP) [2], we have additional challenges. ICP ensures accurate registration when provided with a good initialization, but in our case we have no prior knowledge about the number and the rough shape of the primitives. Moreover, we need to solve the more challenging partial matching problem since each primitive matches only part of the shape, which we do not know in advance.

We represent the shape of each primitive with scale parameters , which denotes the scale of a unit cube along three orthogonal axes. The position and orientation of the primitive are represented by translation, , and Euler angles, , respectively. Thus the primitive is parameterized by . Furthermore, we assume a fixed sampling of the unit cube into a set of points, . Given a point cloud representation of a shape, , our goal is to find the set of primitives that best fit the shape. We employ the idea of Gaussian Force Fields [6] and Truncated Signed Distance Function (TSDF) [29] to formulate the following continuously differentiable energy function which is convex in a large neighborhood of the parameters:


where is the rotation matrix, is the truncation parameter ( in our experiments) and denotes the volumetric-wise sampling ratio that is calculated as the volume of primitive over its number of sampled points . helps avoid local minimum that results in a too small or too large primitive. Our formulation represents the error as a smooth sum of Gaussian kernels, where far away point pairs are penalized less to account for partial matching.

The energy function given in Eq. 1 is sensitive to the parameter . A larger will encourage fitting of large primitives while allowing larger distances between matched point pairs. In order to prefer tighter fitting primitives, we introduce the concept of negative shape, , which is represented as a set of points sampled in the non-occupied space inside the bounding box of a shape. We update our energy function as:


where is the fitting energy between the shape and the primitive and is the fitting energy between the negative shape and the primitive. Given point samples, both and are computed as in Eq. 1. denotes the relative weighting of these two terms and is defined as .

3.2 Optimization

Given the energy formulation described in the previous section, we perform primitive fitting in a sequential manner. During each iteration, we randomly initialize primitives, optimize Eq. 2 for each of these primitives and add the best fitting primitive to our primitive collection. We then remove the points in that are fit by the selected primitive and iterate. We stop once all the points in are fit by a primitive. We optimize Eq. 2 in an iterative manner. We first fix and solve for and , we then fix and and solve for . In our experiments this optimization converges in iterations and we use the L-BFGS toolbox [32] at each optimization step. We summarize this process with the pseudo-code given in Alg. 1.

1:Given shape point clouds Q and empty primitive set X;
2:, t = 0;
3:while  or maxPrimNum do
4:      Inf;
5:     for maxRandNum do
6:         , random initialize , , ;
7:         while  or maxIter do
8:              fix , solve , , by Eq .2;
9:              fix , , update by Eq .2;
10:              calculate by Eq .2;
11:              if  then
12:                  , ;               
13:              ;
14:              , , ;               
15:     , add to , ;
16:     Remove fitted points from and add to non-occupied space return
Algorithm 1 Primitive fitting

Simplification with symmetry. We utilize the symmetry characteristics of man-made shapes to further speed up the primitive parsing procedure. We use axis-aligned 3D objects where symmetric objects have a common global symmetry plane. We compare the geometry on the two sides of this plane to decide whether an object is symmetric or not. Once we obtain a primitive that lies on one side of the symmetry plane, we automatically generate the symmetric primitive on the other side of the plane.

Refinement. At each step, we fit primitives with a relatively larger Gaussian field ( in Eq. 1) for fast convergence and easier optimization. We then refine the fitting with a finer energy space () to match the primitive to the detailed shape of the object. While our random search scheme enables a fast parsing method, errors may accumulate in the final set of primitives. To avoid such problems, we perform a post-refinement step. We refine the parameters of a single primitive while fixing the other parameters. We use the parameters of obtained from the initial fitting as initialization. We define the energy terms in Eq. 2 with respect to the points that are fit by and the points that are not fit by any primitive yet. We note that this sequential refinement is similar to back propagation used to train neural networks. In our experiments, we perform the refinement each time we fit 3 new primitives.

4 3D-PRNN: 3D Primitive Recurrent Neural Networks

Generating primitive-based 3D shapes is a challenging task due to the complex multi-modal distribution of shapes and the unconstrained number of primitives required to model such complex shapes. We propose 3D-PRNN, a generative recurrent neural network to accomplish this task. 3D-PRNN can be trained to generate novel shapes both randomly and by conditioning on partial shape observations such as a single depth map.

Figure 5: Detailed architectures of (a) the depth map encoder and (b) the primitive recurrent generator unit in 3D-PRNN. See the architecture descriptions in Section 4.1.

4.1 Network Architecture

An overview of the 3D-PRNN network is illustrated in Fig. 2. The network gets as input a single depth image and sequentially predicts primitives to form a 3D shape. For each primitive, the network predicts its shape (height, length, width), position (i.e. translation), and orientation (i.e. rotation). Additionally, at each step, a binary end of generation signal is predicted which indicates no more primitive should be generated.

Depth map encoder. Each input depth map, , is first resized to be in dimension with values in the range  (we set the value of background regions to 0). is passed to an encoder which consists of stacks of convolutional and LeakyRelu [28] layers as shown in Fig. 5 (a): the first layer has kernels of size

and stride

, with a LeakyRelu layer of slope in the negative part. The second layer consists of kernels of size (stride

), followed by the same setting of LeakyRelu and a max pooling layer. The third layer has

kernels of size (stride

) followed by LeakyRelu and max pooling. The next two fully-connected layers has neurons of

and . The output feature vector is then sent to the recurrent generator to predict a sequence of primitives.

Recurrent generator.

We apply the Long Short-Term Memory (LSTM) unit inside the recurrent generator, which is shown to be better at alleviating the vanishing or exploding gradient problems 

[30] when training RNNs. The architectural design is shown in Fig. 5 (b). The prediction unit consists of layers of recurrently connected hidden layers (we set , which is found to be sufficient to model the complex primitive distributions) that encode both the depth feature and the previously predicted primitive and then computes the output vector, . is used to parametrize a predictive distribution over the next possible primitive . The hidden layer activations are computed by iterating over the following equations in the range and :


where capsules the input features in the -th layer (when , there is no hidden value propagated from the previous layers and thus ), and denote the hidden and cell states, whereas denote the linear weight matrix (we omit the bias term for brevity), , , , are respectively the input, forget, output, and context gates, which have the same dimension as the hidden states (size of ).

is the logistic sigmoid function and

is the hyperbolic tangent function.

At each time step , the distribution of the next primitive is predicted as

, where we perform a linear transformation on the concatenation of all the hidden values. This concatenation is similar in spirit to using

skip connections [38, 18]

, which is shown to help training and mitigate the ‘vanishing gradient’ problem. In a similar fashion, we also pass the depth feature

to all the hidden layers. We will explain latter how the primitive configuration is sampled from a distribution predicted from .

We predict parameters of one axis per time conditioned on the previous axis. We model this joint distribution of parameter on each axis

(where indicates one of the 3 axes of space) as a mixture of Gaussians conditioned on previous axis with mixture components:


where , , and

are the weight, mean, standard deviation, and correlation of each mixture component respectively, predicted from a fully connected layer

. Note that is the binary stopping sign indicating whether the current primitive is the final one and it helps with predicting a variable-length sequence of primitives. In our experiments we set . We randomly sample a single instance drawn from the distribution . The sequence represents the parameters in the following order: for the shape translation configuration on axis of the first primitive and the stopping sign.

This is essentially a mixture density network (MDN) [4] on top of the LSTM output and its loss is defined:


The MDN is trained by maximizing the log likelihood of ground truth primitive parameters in each time step, where we calculate gradients explicitly for backpropagation as shown by Graves 

[15]. We found this stepwise supervised training works well and avoids sequential sampling used in [39, 11].

Geometric constraints. Another challenge in predicting primitive-based shape is to model rotation, given that the rotation axis is sensitive to slight change in rotation values under Euler angles. We found that by jointly predicting the rotation axis and the rotation value , both the rotation prediction performs better and the overall primitive distribution modeling get alleviated as shown in Fig. 6, quantitative experiments are in Sec. 5.3. The rotation axis () is predicted by a three-layered fully connected network with size , and and sigmoid function as shown in fig. 5. The rotation value () is predicted by a separate three-layered fully connected network with size , and and a function.

4.2 Loss Function

The overall sequence loss of our network is:


is defined in Eq. 4.1. is a mean square loss between predicted, , and target, , rotation. is the mean square loss between the predicted, , and ground truth, , rotation axis.

Figure 6: Training performance comparison on validation set of synthetic depth map from ModelNet. Both the mixture density loss and the rotation MSE loss are averaged by sequence length. The rotation values are normalized and values can have ranges around 13, compared with the

MSE loss. Our mixture density estimation and rotation value estimation performs better by enforcing loss on predicting rotation axis.

Figure 7: Shape synthesis result. We show various random sampled shapes by our 3D-PRNN. The network is trained and tested without context input. The coloring indicates the prediction order.
Figure 8: Sample reconstruction from synthetic data from ShapeNet

. We show the input depth map, with the most probable shape reconstruction from 3D-PRNN, and three successive random sampling results, compared with our ground truth primitive representation.

Figure 9: Sample reconstruction from real depth map in NYUdv2. We show the input depth map, with the most probable shape reconstruction from 3D-PRNN, and two successive random sampling results, compared with our ground truth primitive representation.

5 Experiments and Discussions

We show quantitative results on automatic shape synthesis. We quantitatively evaluate our 3D-PRNN in two tests: 1) 3D reconstruction on synthetic depth maps and 2) using real depth maps as input.

We train our 3D-PRNN on ModelNet [43] categories: 889 chairs, 392 tables and 200 nightstands. We employ the provided another 100 testing samples from each class for evaluation. We train a single network with all shapes classes jointly. In all experiments, to avoid overfitting, we hold out

of the training samples, which are then used to choose the number of training epochs. We then retrain the network using the entire training set. Since a single network is trained to encode all three classes, when predicting shape from depth images, for example, there is an implicit class prediction as well.

5.1 Implementation

We implement 3D-PRNN network using Torch. We train our network on primitive-based shape configurations generated as described in Sec. 

3. The parameters of each primitive (i.e. shape, translation and rotation) are normalized to have zero mean and standard deviation. We observe that the order of the primitives generated by the method described in Sec. 3 involves too much randomness that makes training hard. Instead, we pre-sort the primitives based on the height of each shape center in a decreasing fashion. This simple sorting strategy significantly boosts the training performance. Additionally, our network is trained only on one side of the symmetric shapes to shorten the sequence length and speed up the training process. To train with the generative mechanism, we use simple random sampling technique. We use ADAM [23] to update network parameters with a learning rate of , , and . We train the network with batch size and on the synthetic data and on the real data respectively.

At test time, the network takes a single depth map and sequentially generates primitives until a stop sign is predicted. To initialize the first RNN feature , we perform a nearest neighbor query based on the encoded feature of the depth map to retrieve the most similar shape in the training set and use the configuration on its first primitive.

5.2 Shape Synthesis

3D-PRNN can be trained to generate new primitive-based shapes. Fig. 7 shows our randomly generated shapes synthesized from all three shape classes. We initialize the first RNN feature

with a random sampled primitive configuration from the training set. Since the first feature corresponds to “width”, “translation in x-axis”, and “rotation on x-axis” of the primitive, formally this initialization process is defined as drawing a sample from a discrete uniform distribution of these parameters where the discrete samples are constructed from the training examples. The figure shows that 3D-PRNN can learn to generate representative samples from multiple classes and sometimes creates hybrids from multiple classes.

5.3 Shape Reconstruction from Single Depth View

Synthetic data. We project synthetic depth maps from training meshes. For both training and testing, we perform rejection-sampling on a unit sphere for 5 views, bounded within 20 degrees of the equator. The complete 3D shape is then predicted using a single depth map as input to 3D-PRNN. Our model can generate a sampling of complete shapes that match the input depth, as well as the most likely configuration, determined as the mean of the Gaussian from the most probable mixture. We report 3D intersection over union (IoU) and surface-to-surface distance [31] of the most likely predicted shape to the ground truth mesh. To compute IoU, the ground truth mesh is voxelized to 30 x 30 x 30 resolution, and IoU is calculated based on whether the voxel centers fall inside the predicted primitives or not. Surface-to-surface distance is computed using 5,000 points sampled on the primitive and ground truth surfaces, and the distance is normalized by the diameter of a sphere tightly fit to the ground truth mesh (e.g. 0.05 is of object maximum dimension).

Tables 1 and 2 show our quantitative results. “GT prim” is the ground truth primitive representation generated by our parsing optimization method during training. This serves as an upper bound on performance by our method, corresponding to how well the primitive model can fit the true meshes. “NN Baseline” is the nearest neighbor retrieval of shape in training set based on the embedded depth feature from our network. By enforcing rotation axis constraints (“3D-PRNN + rot loss”), our 3D-PRNN achieves better performance, which conforms with the learning curve as shown in Fig. 6. Though both nearest neighbor and 3D-PRNN are based on the trained encoding, 3D-PRNN outperforms NN Baseline for table and nightstand, likely because it is able to generate a greater diversity of shapes from limited training data. We compare with the voxel-based reconstruction of Wu et al. [43], training and testing their method on the same data using publicly available code. Since Wu et al. generate randomized results, we measure the average result over ten runs. Our method performs similarly to Wu et al. [43] on the IoU measure. Wu et al. performs better on surface distance, which is less sensitive to alignment but more sensitive to details in structures. The performance of our ground truth primitives confirms that much of our reduced performance in surface distance is due to using a coarser abstraction (which though not preserving surface detail has other benefits, as discussed in introduction).

chair table night stand
GT prim 0.473 0.533 0.657
NN Baseline 0.269 0.220 0.256
Wu et al. [43] (mean) 0.253 0.250 0.295
3D-PRNN 0.245 0.188 0.204
3D-PRNN + rot loss 0.238 0.263 0.266
Table 1: Shape IoU evaluation in synthetic depth map in ModelNet. We explore two settings of 3D-PRNN with or without rotation axis constrains, and compare it with ground truth primitive and the nearest neighbor baseline. We also compare to the Wu et al. [43] deep network voxel generation method.
chair table night stand
GT prim 0.049 0.044 0.044
NN baseline 0.075 0.089 0.100
Wu et al. [43] (mean) 0.045 0.035 0.057
3D-PRNN 0.074 0.080 0.104
3D-PRNN + rot loss 0.074 0.078 0.092
Table 2: Surface-to-surface distance evaluation in synthetic depth map in ModelNet. We explore two settings of 3D-PRNN with or without rotation axis constrains, and compare it with ground truth primitive and the nearest neighbor baseline.

Real data (NYU Depth V2). We also test our model on NYU Depth V2 dataset [36] which is much harder than synthetic due to limited training data and the fact that depth images of objects are in lower resolution, noisy, and often occluded conditions. We employ the ground truth data labelled by Guo and Hoiem [17]

, where 30 models are manually selected to represent 6 categories of common furniture: chair, table, desk, bed, bookshelf and sofa. We fine-tune our network that was trained on synthetic data using the training set of NYU Depth V2. We report results on test set based on the same evaluation metric as the synthetic test shown in Table 

4 and 3. Since nightstand is less common in the training set and often occluded depth regions may be similar to those for tables, the network often predicts primitives in the shapes of tables or chairs for nightstands, resulting in worse performance for that class. Sample qualitative results are shown in Fig. 9.

3D Shape Segmentation. Since our primitive based reconstructions are following meaningful part configurations naturally, another application where our method can apply is shape segmentation. Please refer to our supplemental material for shape segmentation task details and results, where we compare with state of the art methods as well.

class chair table night stand
GT prim 0.037 0.048 0.020
NN baseline+ft 0.118 0.176 0.162
NN baseline 0.101 0.164 0.160
3D-PRNN+ft 0.112 0.168 0.192
3D-PRNN 0.110 0.181 0.194
Table 3: Surface-to-surface distance evaluation in real depth map in NYUd v2. We explore two settings of 3D-PRNN with (+ft) or without fine-tuning, and compare it with ground truth primitive and the nearest neighbor baseline.
class chair table night stand
GT prim 0.543 0.435 0.892
NN baseline +ft 0.171 0.078 0.286
NN baseline 0.145 0.076 0.262
3D-PRNN +ft 0.158 0.075 0.081
3D-PRNN 0.138 0.052 0.086
Table 4: Shape IoU evaluation in real depth map in NYUd v2. We explore two settings of 3D-PRNN with (+ft) or without fine-tuning, and compare it with ground truth primitive and the nearest neighbor baseline.

Conclusions and Future Work.

We present 3D-PRNN, a generative recurrent neural network that uses recurring primitive based abstractions for shape synthesis. 3D-PRNN models complex shapes with a low parametric model, which advantages such as being capable of modeling shapes with fewer training examples available, and a large intra- and inter-class variance. Evaluations on synthetic and real depth map reconstruction tasks show that results comparable to higher degree of freedom representations can be achieved with our method. Future explorations include allowing various primitive configurations beyond cuboids (i.e. cylinders or spheres), encoding explicit joints and spatial relationship between primitives.


This research is supported in part by NSF award 14-21521 and ONR MURI grant N00014-16-1-2007. We thank David Forsyth for insightful comments and discussion.


  • [1] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In IEEE CVPR, pages 3762–3769, 2014.
  • [2] P. J. Besl and N. D. McKay. A method for registration of 3-d shapes. IEEE PAMI, 14(2):239–256, 1992.
  • [3] I. Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987.
  • [4] C. M. Bishop. Mixture density networks. 1994.
  • [5] A. Bódis-Szomorú, H. Riemenschneider, and L. Van Gool. Fast, approximate piecewise-planar modeling based on sparse structure-from-motion and superpixels. In IEEE CVPR, pages 469–476, 2014.
  • [6] F. Boughorbel, M. Mercimek, A. Koschan, and M. Abidi. A new method for the registration of three-dimensional point-sets: The gaussian fields framework. Image and Vision Computing, 28(1):124–137, 2010.
  • [7] A.-L. Chauve, P. Labatut, and J.-P. Pons. Robust piecewise-planar 3d reconstruction and completion from large-scale unstructured point data. In IEEE CVPR, pages 1261–1268, 2010.
  • [8] T. Chen, Z. Zhu, A. Shamir, S.-M. Hu, and D. Cohen-Or. 3-sweep: Extracting editable objects from a single photo. ACM TOG, 32(6):195, 2013.
  • [9] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, pages 628–644, 2016.
  • [10] S. J. Dickinson, A. Rosenfeld, and A. P. Pentland. Primitive-based shape modeling and recognition. In Visual Form, pages 213–229. 1992.
  • [11] S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al.

    Attend, infer, repeat: Fast scene understanding with generative models.

    In Advances in Neural Information Processing Systems, pages 3225–3233, 2016.
  • [12] M. Firman, O. Mac Aodha, S. Julier, and G. J. Brostow. Structured prediction of unobserved voxels from a single depth image. In IEEE CVPR, pages 5431–5440, 2016.
  • [13] G. Gallego and A. Yezzi. A compact formula for the derivative of a 3-d rotation in exponential coordinates. Journal of Mathematical Imaging and Vision, 51(3):378–384, 2015.
  • [14] R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016.
  • [15] A. Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013.
  • [16] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. DRAW: A recurrent neural network for image generation. In

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015

    , pages 1462–1471, 2015.
  • [17] R. Guo and D. Hoiem. Support surface prediction in indoor scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2144–2151, 2013.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [19] T. Igarashi, S. Matsuoka, and H. Tanaka. Teddy: a sketching interface for 3d freeform design. In ACM SIGGRAPH ’99, pages 409–416.
  • [20] D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In Advances in Neural Information Processing Systems 29, pages 4996–5004. Curran Associates, Inc., 2016.
  • [21] N. Kholgade, T. Simon, A. Efros, and Y. Sheikh. 3d object manipulation in a single photograph using stock 3d models. ACM TOG, 33(4):127, 2014.
  • [22] V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser. Learning part-based templates from large collections of 3d shapes. ACM TOG, 32(4):70, 2013.
  • [23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [24] C. Kurz, X. Wu, M. Wand, T. Thormählen, P. Kohli, and H.-P. Seidel. Symmetry-aware template deformation and fitting. In CGF, volume 33, pages 205–219, 2014.
  • [25] F. Lafarge and P. Alliez. Surface reconstruction through point set structuring. In Computer Graphics Forum, volume 32, pages 225–234, 2013.
  • [26] Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. Cohen-Or, and N. J. Mitra. Globfit: Consistently fitting primitives by discovering global relations. In ACM TOG, volume 30, page 52, 2011.
  • [27] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In IEEE CVPR, pages 2992–2999, 2013.
  • [28] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, volume 30, 2013.
  • [29] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR, pages 127–136, 2011.
  • [30] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML, 28:1310–1318, 2013.
  • [31] J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and D. Hoiem. Completing 3d object shape from one depth image. In IEEE CVPR, pages 2484–2493, 2015.
  • [32] M. Schmidt. minfunc: unconstrained differentiable multivariate optimization in matlab. URL http://www.di.ens.fr/mschmidt/Software/minFunc.html, 2012.
  • [33] R. Schmidt, B. Wyvill, M. C. Sousa, and J. A. Jorge. Shapeshop: Sketch-based solid modeling with blobtrees. In ACM SIGGRAPH 2007 courses, page 43, 2007.
  • [34] R. Schnabel. Efficient point-cloud processing with primitive shapes. PhD thesis, University of Bonn, 2010.
  • [35] R. Schnabel, P. Degener, and R. Klein. Completion and reconstruction with primitive shapes. In CGF, volume 28, pages 503–512, 2009.
  • [36] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, pages 746–760, 2012.
  • [37] S. N. Sinha, D. Steedly, R. Szeliski, M. Agrawala, and M. Pollefeys. Interactive 3d architectural modeling from unordered photo collections. In ACM TOG, volume 27, page 159, 2008.
  • [38] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, pages 2377–2385, 2015.
  • [39] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik. Learning shape abstractions by assembling volumetric primitives. arXiv preprint arXiv:1612.00404, 2016.
  • [40] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, pages 1747–1756, 2016.
  • [41] Y. Wang, S. Asafi, O. van Kaick, H. Zhang, D. Cohen-Or, and B. Chen. Active co-analysis of a set of shapes. ACM Transactions on Graphics (TOG), 31(6):165, 2012.
  • [42] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3d interpreter network. In ECCV, pages 365–382, 2016.
  • [43] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1912–1920, 2015.
  • [44] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NIPS, pages 1696–1704, 2016.
  • [45] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, A. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
  • [46] R. C. Zeleznik, K. P. Herndon, and J. F. Hughes. Sketch: An interface for sketching 3d scenes. In ACM SIGGRAPH 2007 courses, page 19. ACM, 2007.
  • [47] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997.

Appendix A in primitive fitting energy

We define in Sec. 3.1 our primitive fitness energy as in Eq. 1. Where is the volumetric sampling ratio that is defined as the volume of the primitive over its number of sampled points :


The product is the volume of . In our experiments, we set as a predetermined constant of .

Appendix B Derivatives of primitive fitting energy

As we stated in Sec. 3.1, Eq.1 is differentiable and can be solved using robust techniques (L-BFGS[47]). In case , derivatives are analytically defined. Otherwise, is a constant and the derivatives diminish. The derivatives of Eq.1 with respect to translation , scale , and rotation in the primitive parameter set are:


where based on Eq. 13 we have:


is represented in Euler angles, thus is the derivative of rotation matrix with respect to the rotation angles. Further details of the derivatives can be found in [13].

Appendix C Evaluation on primitive fitting

We evaluate our primitive fitting method on the public semantic region dataset by Yi et al. [45]. The dataset contains detailed per-point labeling of model parts in ShapeNetCore [43] through an efficient semi-supervised labeling approach. We test our method on the test split of the chair category with 704 shapes, which contains part labeling of chair seat, back, arm and leg.

We present our result based on the common metric of labeling accuracy at triangle face level, which measures the fraction of faces with correct labels. We project the ground truth per point labeling of shape parts into per face labeling through nearest neighbor search. Given our predicted primitive representation for the voxelized 3D shape, we randomly sample points (we set number of faces) on the shape, assign each point to the segment label of the nearest predicted primitive. We project our per-point segmentation to mesh faces based on a majority vote, i.e. if multiple points with different labels correspond to the same face, we label the face with the label suggested by the majority of the points. Since our primitive representation does not explicitly infer shape part label, we automatically re-label each part segmentation based on the majority vote of the ground truth labeling. We achieve the average face labeling accuracy of 0.843. Qualitative results are shown in Fig. 10. We observe that our primitive parsing method is able to decompose shapes into parts containing semantic meaning. Lower accuracy is often caused by 1) a single primitive that fits the shape but includes more than one type of the semantic meaning, see the shape in the second row, first column; 2) error prediction of regions with aggregation of faces, cylinder-shape handles of the bottom left shape contains more faces than the box-shape chair seat; 3) drawbacks of our method that fails to parse out slim shape segments, see the right bottom shape; 4) ground truth error results from projecting per-point labeling into per-face labeling.

Figure 10: Sample shape semantic labeling results on Yi et al’s dataset [45]. For each row in each column, the left most shape represents the ground truth labeling of Yi et al’s, the middle is the prediction by our method, and the right most one shows our primitive paring results being overlaid on the ground truth shape, the face labeling accuracy is also shown. The coloring indicates shape segments of chair back (green), chair seat (red), chair handle (dark green) and chair leg (yellow). The visualizations are based on face labeling.
Figure 11: Sample reconstruction from synthetic depth map in ModelNet.We show the input depth map, with the most probable shape reconstruction from 3D-PRNN, and three successive random sampling results, compared with our groundtruth primitive representation. Each result is overlaid on the groundtruth cloud points.
Figure 12: Sample reconstruction from real depth map in NYUdv2.We show the input depth map, with the most probable shape reconstruction from 3D-PRNN, and two successive random sampling results, compared with our groundtruth primitive representation. Each result is overlaid on the groundtruth cloud points.

Appendix D LSTMs sequential prediction order

The recurrent generator of 3D-PRNN described in Sec. 4.1 has a pre-determined parameter set prediction ordering. At each time step we sample a single instance drawn from the distribution . The sequence represents the parameters in the following order:

  • Time step 1, for the shape (width) and translation configuration on axis of the 1st primitive and the stopping sign.

  • Time step 2, for the shape (length) and translation configuration on axis of the 1st primitive and the stopping sign.

  • Time step 3, for the shape (height) and translation configuration on axis of the 1st primitive and the stopping sign.

  • Time step 4, for the shape (width) and translation configuration on axis of the 2nd primitive and the stopping sign.

  • Time step 5, for the shape (length) and translation configuration on axis of the 2nd primitive and the stopping sign.

  • (sequential prediction of )

  • Stop when ”End of Generation” is predicted.

Note that the above sequence is for the primitive size parameters. We simultaneously predict rotation parameter and rotation axis : at time step 1 we predict the rotation value on axis and a binary signal meaning whether there is rotation on axis or not for the first primitive, time step 2 predict and of the first primitive, then and of the first primitive. This simultaneous prediction also stops when ”End of Generation” is predicted by the main LSTM prediction sequence outlined above.

Appendix E Sampling procedure

Note that an unexpected sample that is far from the mean of the distribution will cause accumulated error to the following predictions, during testing each time we sample from the first two most possible mixture component, in training we still perform random sampling on all mixture components. This strategy improves stability of our network performance in synthetic data case. In the real data case, we found that applying random sampling among all mixture components during both training and testing time can produce successive reasonable shapes. This is due to the fact that the ground truth shapes in real data are of simple structures that is easier to model by the network.

Appendix F Additional Results

f.1 Synthetic data

Additional qualitative results of shape reconstruction from a single depth view for synthetic data are showed in Fig. 11.

f.2 Real data

Additional qualitative results of shape reconstruction from a single depth view for real data are showed in Fig. 12.

Appendix G Application: shape segmentation

Our primitive based reconstructions naturally align with semantic part configurations, and thus are directly applicable for shape segmentation tasks. To demonstrate this, we assume an input 3D shape is fully observed and use 3D-PRNN to reconstruct it as a collection of primitives. We then use the resulting primitives to semantcally segment the original input shape.

Volumetric Encoder. The input 3D shape is represented as a binary voxel grid. We revise our previous depth image based encoder network to handle such voxelized input. Specifically, the first layer has kernels of size and stride , with a LeakyRelu layer of slope in the negative part. The second layer consists of kernels of size (stride ), followed by the same setting of LeakyRelu and a max pooling layer of regions. The third layer has kernels of size (stride ) followed by similar LeakyRelu and max pooling layers. The next two fully-connected layers have and neurons respectively. The output feature vector of dimension is then fed to the recurrent generator to predict a sequence of primitives. Note that the decoder and LSTM parts of the network remain the same.

Evaluation. We evaluate the performance of 3D-PRNN for the shape segmentation task on the COSEG dataset [41]. Since there is no groundtruth primitive representation of the dataset, for each shape, we automatically extract the tightest oriented box corresponding to each labeled segment and use it as a groudtruth primitive. Primitive ordering is pre-determined based on the height of each box center in a decreasing manner. Similar to the training scheme in the single depth reconstruction case, we first train our network on the random split of of the data, and validate it on of the data to choose the required number of training epochs. We then train on this of the data and perform tests on the remaining of the data which has never been seen by the network.

Since the largest class of objects in COSEG is the chair category with 400 shapes, we test on this category. However, this is still a too small set for training an RNN. Hence, we first pre-train our network on ModelNet chair class with 889 shapes (with a validation split), then fine-tune our result on the COSEG chairs training set. This fine-tuning strategy increases our segmentation accuracy by . We use ADAM to update network parameters with a learning rate of , , and batch size for training and for fine-tuning.

Results. We present our result based on the metric same as Sec. C. We compare the segmentation obtained on the most probable generation result of 3D-PRNN with the template-based segmentation result of Kim et al. [22], which is a state-of-the-art method that also fits oriented boxes to shapes for segmentation. We provide quantitative comparison in Table 5. Note that our 3D-PRNN sometimes misses to predict some of the parts which numerically lowers our performance. Thus, we report both our overall performance and the average performance excluding such unpredicted parts. We also report the accuracy of the simple approach we used to generate the groundtruth primitive representations for training as this provides an upper bound for our method. In cases where 3D-PRNN predicts the correct number of primitives, it outperforms the method of Kim et al. [22].

GT Kim et al. 3D-PRNN
0.896 0.829 0.796
Kim et al.
(exc. unpredicted boxes)
(exc. unpredicted boxes)
0.836 0.859
Table 5: Shape segmentation result on the chair category of the COSEG dataset