1 Introduction
Many robotics and graphics applications require 3D interpretations of sensory data. For example, picking up a cup, moving a chair, predicting whether a stack of blocks will fall, or looking for keys on a messy desk all rely on at least a vague idea of object position, shape, contact and connectedness. A major challenge is how to represent 3D object geometry in a way that (1) can be predicted from noisy or partial observations; and (2) is useful for reasoning about contact, support, extent, and so on. Recent efforts often focus on voxelized volumetric representations (e.g., [44, 43, 14, 9]). Instead, we propose to represent objects with 3D primitives (oriented 3D rectangles, i.e. cuboids). Compared to voxels, the primitives are much more compact, for example 45D for 5 primitives parameterized by scalerotationtranslation vs 32,256D for a 32x32x32 voxel grid. Also, primitives are holistic — representing an object with a few parts greatly simplifies reasoning about stability, connectedness, and other important properties. Primitivebased 3D object representations have long been popular in psychology (e.g. “geons” by Biederman [3]) and interactive graphics (e.g. “Teddy” [19]
), but they are less commonly employed in modern computer vision due to the challenges of learning and predicting models that consist of an arbitrary number of parameterized components.
Our goal is to learn 3D primitive representations of objects from unannotated 3D meshes. We follow an encoderdecoder strategy, inspired by recent work [15, 40], using a recursive neural network (RNN) to encode an implicit shape representation and then sequentially generate primitives to approximate the shape as shown in Fig. 1. One challenge in training such a primitive generation network is acquiring ground truth data for primitivebased shape representations. To address this challenge, we propose an efficient method based on Gaussian Fields and energy minimization [6]
to iteratively parse shapes into primitive components. We optimize a differentiable loss function using robust techniques (LBFGS
[47]). We use this (unsupervised) optimization process to create the primitive ground truth, solving for a set of primitives that approximates each 3D mesh in a collection. The RNN can then be trained to generate new primitivebased shapes that are representative of an object class’ distribution or to complete an object’s shape given a partial observation such as a depth image or point cloud.To model shapes, we propose 3DPRNN, an RNNbased generative model that predicts contextsensitive sequences of primitives in objectcentric coordinates, as shown in Figure 2. To predict shape from depth, the network is trained jointly with the single depth image and a sequence of primitives configurations (shape, translation and rotation) that form the complete shape. During testing, the network gets input of a depth map and sequentially predicts primitives (ending with a stop signal) to reconstruct the shape. Our generative RNN architecture is based on a Long ShortTerm Memory (LSTM) and a Mixture Density Network (MDN).
We evaluate our proposed generative model by comparing with baselines and the stateoftheart methods. We show that, even though our method has less degrees of freedom in representation, it achieves comparable accuracy with voxel based reconstruction methods. We also show that encoding further symmetry and rotation axis constraints in our network significantly boosts performance.
Our main contributions are:

We propose 3DPRNN: a generative recurrent neural network that reconstructs 3D shapes as sequences of primitives given a single depth image.

We propose an efficient method to fit primitives from point clouds based on Gaussianfields and energy minimization. Our primitive representation provides enough training samples for 3DPRNN in 3D reconstruction.
2 Related Work
Primitivebased Shape Modeling: Biederman, in the early 1980s, popularized the idea of representing shapes as a collection of components or primitives called “geons” [3], and early computer vision algorithms attempted to recover objectcentered 3D volumetric primitives from single images [10]. In computer aided design, primitivebased shape representations are used for 3D scene sketches [46, 33], shape completion from point clouds [35, 26, 34]. In the case that scans of shapes often have canonical parts like planes or boxes and efficient solution for large data is required, primitives are used in reconstructions of urban and architectural scenes [7, 25, 5, 37]. Recently, more compact and parametric representations in the form of template objects [42], and set of primitives [39] have been introduced. These representations, however, require non trivial effort to accommodate variable number of configurations within the object class they are trained for. This is mainly because of their single feedforward design, which implicitly forces the prediction of a discrete number of variables at the same time.
Object 3D shape reconstruction can be attempted given an RGB image [44, 14, 1, 9]or depth image [43, 31, 12] Recently proposed representations and prediction approaches for 3D data in the context of prediction from sensory input have mainly either focused on part and objectbased retrieval from large repositories [31, 1, 27, 21], or voxelized volumetric representations [44, 43, 14, 9]. A better model fitting includes part deformation [8] and symmetry [24]. Wu et al. [43]
present preliminary results on automatic shape completion from depth by classifying hidden voxels with a deep network. Wu et al.
[42] reconstruct shapes based on predicted skeletons. Unlike meshbased or voxelbased shape reconstruction, our method predicts shapes with aggregations of primitives that has the benefit for lower computational and storage cost.Generative Models with RNNs: Graves [15] uses Long Shortterm Memory recurrent neural networks to generate complex sequences of text and online handwriting. Gregor et al. [16] combine LSTM and a variational autoencoder, called the Deep Recurrent Attentive Writer (DRAW) architecture, for image generation. The DRAW architecture is a pair of RNNs with an encoder network that compresses the real images presented during training, and a decoder that reconstitutes images after receiving codes. Rezende et al. [20] extend DRAW to learn generative models of 3D structures and recover this structure from 2D images via probabilistic inference. Our 3DPRNN, which sequentially generates primitives, is inspired by Graves’ work to sequentially generate parameterized handwriting strokes and the PixelRNN approach [40] to model natural images as sequentially generated pixels. To produce parameterized 3D primitives (oriented cuboids), we customize the RNN to encode explicit geometric constraints of symmetry and rotation. For example, separately predicting whether a primitive should rotate along each axis and by how much improves results over more simply predicting rotation values, since many objects consist of several (unrotated) cuboids.
3 Fitting Primitives from Point Clouds
One challenge in training our 3DPRNN primitive generation network is the lack of large scale ground truth primitive based shape reconstruction data. We propose an efficient method to generate such data. Given a point cloud representation of a shape, our approach finds the most plausible primitives to fit in a sequential manner, e.g. given a table, the algorithm might identify the primitive that fits to the top surface first and then the legs successively. We use rectangular cuboids as primitives which provide a plausible abstraction for most manmade objects. Our method proposes a fast parsing solution to decompose shapes with varying complexity into a set of such primitives.
3.1 Primitive Fitness Energy
We formulate the successive fitting of primitives as an energy minimization scheme. While primitive fitting at each step resembles the method of Iterative Closest Point (ICP) [2], we have additional challenges. ICP ensures accurate registration when provided with a good initialization, but in our case we have no prior knowledge about the number and the rough shape of the primitives. Moreover, we need to solve the more challenging partial matching problem since each primitive matches only part of the shape, which we do not know in advance.
We represent the shape of each primitive with scale parameters , which denotes the scale of a unit cube along three orthogonal axes. The position and orientation of the primitive are represented by translation, , and Euler angles, , respectively. Thus the primitive is parameterized by . Furthermore, we assume a fixed sampling of the unit cube into a set of points, . Given a point cloud representation of a shape, , our goal is to find the set of primitives that best fit the shape. We employ the idea of Gaussian Force Fields [6] and Truncated Signed Distance Function (TSDF) [29] to formulate the following continuously differentiable energy function which is convex in a large neighborhood of the parameters:
(1) 
where is the rotation matrix, is the truncation parameter ( in our experiments) and denotes the volumetricwise sampling ratio that is calculated as the volume of primitive over its number of sampled points . helps avoid local minimum that results in a too small or too large primitive. Our formulation represents the error as a smooth sum of Gaussian kernels, where far away point pairs are penalized less to account for partial matching.
The energy function given in Eq. 1 is sensitive to the parameter . A larger will encourage fitting of large primitives while allowing larger distances between matched point pairs. In order to prefer tighter fitting primitives, we introduce the concept of negative shape, , which is represented as a set of points sampled in the nonoccupied space inside the bounding box of a shape. We update our energy function as:
(2) 
where is the fitting energy between the shape and the primitive and is the fitting energy between the negative shape and the primitive. Given point samples, both and are computed as in Eq. 1. denotes the relative weighting of these two terms and is defined as .
3.2 Optimization
Given the energy formulation described in the previous section, we perform primitive fitting in a sequential manner. During each iteration, we randomly initialize primitives, optimize Eq. 2 for each of these primitives and add the best fitting primitive to our primitive collection. We then remove the points in that are fit by the selected primitive and iterate. We stop once all the points in are fit by a primitive. We optimize Eq. 2 in an iterative manner. We first fix and solve for and , we then fix and and solve for . In our experiments this optimization converges in iterations and we use the LBFGS toolbox [32] at each optimization step. We summarize this process with the pseudocode given in Alg. 1.
Simplification with symmetry. We utilize the symmetry characteristics of manmade shapes to further speed up the primitive parsing procedure. We use axisaligned 3D objects where symmetric objects have a common global symmetry plane. We compare the geometry on the two sides of this plane to decide whether an object is symmetric or not. Once we obtain a primitive that lies on one side of the symmetry plane, we automatically generate the symmetric primitive on the other side of the plane.
Refinement. At each step, we fit primitives with a relatively larger Gaussian field ( in Eq. 1) for fast convergence and easier optimization. We then refine the fitting with a finer energy space () to match the primitive to the detailed shape of the object. While our random search scheme enables a fast parsing method, errors may accumulate in the final set of primitives. To avoid such problems, we perform a postrefinement step. We refine the parameters of a single primitive while fixing the other parameters. We use the parameters of obtained from the initial fitting as initialization. We define the energy terms in Eq. 2 with respect to the points that are fit by and the points that are not fit by any primitive yet. We note that this sequential refinement is similar to back propagation used to train neural networks. In our experiments, we perform the refinement each time we fit 3 new primitives.
4 3DPRNN: 3D Primitive Recurrent Neural Networks
Generating primitivebased 3D shapes is a challenging task due to the complex multimodal distribution of shapes and the unconstrained number of primitives required to model such complex shapes. We propose 3DPRNN, a generative recurrent neural network to accomplish this task. 3DPRNN can be trained to generate novel shapes both randomly and by conditioning on partial shape observations such as a single depth map.
4.1 Network Architecture
An overview of the 3DPRNN network is illustrated in Fig. 2. The network gets as input a single depth image and sequentially predicts primitives to form a 3D shape. For each primitive, the network predicts its shape (height, length, width), position (i.e. translation), and orientation (i.e. rotation). Additionally, at each step, a binary end of generation signal is predicted which indicates no more primitive should be generated.
Depth map encoder. Each input depth map, , is first resized to be in dimension with values in the range (we set the value of background regions to 0). is passed to an encoder which consists of stacks of convolutional and LeakyRelu [28] layers as shown in Fig. 5 (a): the first layer has kernels of size
and stride
, with a LeakyRelu layer of slope in the negative part. The second layer consists of kernels of size (stride), followed by the same setting of LeakyRelu and a max pooling layer. The third layer has
kernels of size (stride) followed by LeakyRelu and max pooling. The next two fullyconnected layers has neurons of
and . The output feature vector is then sent to the recurrent generator to predict a sequence of primitives.Recurrent generator.
We apply the Long ShortTerm Memory (LSTM) unit inside the recurrent generator, which is shown to be better at alleviating the vanishing or exploding gradient problems
[30] when training RNNs. The architectural design is shown in Fig. 5 (b). The prediction unit consists of layers of recurrently connected hidden layers (we set , which is found to be sufficient to model the complex primitive distributions) that encode both the depth feature and the previously predicted primitive and then computes the output vector, . is used to parametrize a predictive distribution over the next possible primitive . The hidden layer activations are computed by iterating over the following equations in the range and :(3)  
(4)  
(5)  
(6)  
(7) 
where capsules the input features in the th layer (when , there is no hidden value propagated from the previous layers and thus ), and denote the hidden and cell states, whereas denote the linear weight matrix (we omit the bias term for brevity), , , , are respectively the input, forget, output, and context gates, which have the same dimension as the hidden states (size of ).
is the logistic sigmoid function and
is the hyperbolic tangent function.At each time step , the distribution of the next primitive is predicted as
, where we perform a linear transformation on the concatenation of all the hidden values. This concatenation is similar in spirit to using
skip connections [38, 18], which is shown to help training and mitigate the ‘vanishing gradient’ problem. In a similar fashion, we also pass the depth feature
to all the hidden layers. We will explain latter how the primitive configuration is sampled from a distribution predicted from .We predict parameters of one axis per time conditioned on the previous axis. We model this joint distribution of parameter on each axis
(where indicates one of the 3 axes of space) as a mixture of Gaussians conditioned on previous axis with mixture components:(8) 
where , , and
are the weight, mean, standard deviation, and correlation of each mixture component respectively, predicted from a fully connected layer
. Note that is the binary stopping sign indicating whether the current primitive is the final one and it helps with predicting a variablelength sequence of primitives. In our experiments we set . We randomly sample a single instance drawn from the distribution . The sequence represents the parameters in the following order: for the shape translation configuration on axis of the first primitive and the stopping sign.This is essentially a mixture density network (MDN) [4] on top of the LSTM output and its loss is defined:
(9) 
The MDN is trained by maximizing the log likelihood of ground truth primitive parameters in each time step, where we calculate gradients explicitly for backpropagation as shown by Graves
[15]. We found this stepwise supervised training works well and avoids sequential sampling used in [39, 11].Geometric constraints. Another challenge in predicting primitivebased shape is to model rotation, given that the rotation axis is sensitive to slight change in rotation values under Euler angles. We found that by jointly predicting the rotation axis and the rotation value , both the rotation prediction performs better and the overall primitive distribution modeling get alleviated as shown in Fig. 6, quantitative experiments are in Sec. 5.3. The rotation axis () is predicted by a threelayered fully connected network with size , and and sigmoid function as shown in fig. 5. The rotation value () is predicted by a separate threelayered fully connected network with size , and and a function.
4.2 Loss Function
The overall sequence loss of our network is:
(10)  
(11)  
(12) 
is defined in Eq. 4.1. is a mean square loss between predicted, , and target, , rotation. is the mean square loss between the predicted, , and ground truth, , rotation axis.
5 Experiments and Discussions
We show quantitative results on automatic shape synthesis. We quantitatively evaluate our 3DPRNN in two tests: 1) 3D reconstruction on synthetic depth maps and 2) using real depth maps as input.
We train our 3DPRNN on ModelNet [43] categories: 889 chairs, 392 tables and 200 nightstands. We employ the provided another 100 testing samples from each class for evaluation. We train a single network with all shapes classes jointly. In all experiments, to avoid overfitting, we hold out
of the training samples, which are then used to choose the number of training epochs. We then retrain the network using the entire training set. Since a single network is trained to encode all three classes, when predicting shape from depth images, for example, there is an implicit class prediction as well.
5.1 Implementation
We implement 3DPRNN network using Torch. We train our network on primitivebased shape configurations generated as described in Sec.
3. The parameters of each primitive (i.e. shape, translation and rotation) are normalized to have zero mean and standard deviation. We observe that the order of the primitives generated by the method described in Sec. 3 involves too much randomness that makes training hard. Instead, we presort the primitives based on the height of each shape center in a decreasing fashion. This simple sorting strategy significantly boosts the training performance. Additionally, our network is trained only on one side of the symmetric shapes to shorten the sequence length and speed up the training process. To train with the generative mechanism, we use simple random sampling technique. We use ADAM [23] to update network parameters with a learning rate of , , and . We train the network with batch size and on the synthetic data and on the real data respectively.At test time, the network takes a single depth map and sequentially generates primitives until a stop sign is predicted. To initialize the first RNN feature , we perform a nearest neighbor query based on the encoded feature of the depth map to retrieve the most similar shape in the training set and use the configuration on its first primitive.
5.2 Shape Synthesis
3DPRNN can be trained to generate new primitivebased shapes. Fig. 7 shows our randomly generated shapes synthesized from all three shape classes. We initialize the first RNN feature
with a random sampled primitive configuration from the training set. Since the first feature corresponds to “width”, “translation in xaxis”, and “rotation on xaxis” of the primitive, formally this initialization process is defined as drawing a sample from a discrete uniform distribution of these parameters where the discrete samples are constructed from the training examples. The figure shows that 3DPRNN can learn to generate representative samples from multiple classes and sometimes creates hybrids from multiple classes.
5.3 Shape Reconstruction from Single Depth View
Synthetic data. We project synthetic depth maps from training meshes. For both training and testing, we perform rejectionsampling on a unit sphere for 5 views, bounded within 20 degrees of the equator. The complete 3D shape is then predicted using a single depth map as input to 3DPRNN. Our model can generate a sampling of complete shapes that match the input depth, as well as the most likely configuration, determined as the mean of the Gaussian from the most probable mixture. We report 3D intersection over union (IoU) and surfacetosurface distance [31] of the most likely predicted shape to the ground truth mesh. To compute IoU, the ground truth mesh is voxelized to 30 x 30 x 30 resolution, and IoU is calculated based on whether the voxel centers fall inside the predicted primitives or not. Surfacetosurface distance is computed using 5,000 points sampled on the primitive and ground truth surfaces, and the distance is normalized by the diameter of a sphere tightly fit to the ground truth mesh (e.g. 0.05 is of object maximum dimension).
Tables 1 and 2 show our quantitative results. “GT prim” is the ground truth primitive representation generated by our parsing optimization method during training. This serves as an upper bound on performance by our method, corresponding to how well the primitive model can fit the true meshes. “NN Baseline” is the nearest neighbor retrieval of shape in training set based on the embedded depth feature from our network. By enforcing rotation axis constraints (“3DPRNN + rot loss”), our 3DPRNN achieves better performance, which conforms with the learning curve as shown in Fig. 6. Though both nearest neighbor and 3DPRNN are based on the trained encoding, 3DPRNN outperforms NN Baseline for table and nightstand, likely because it is able to generate a greater diversity of shapes from limited training data. We compare with the voxelbased reconstruction of Wu et al. [43], training and testing their method on the same data using publicly available code. Since Wu et al. generate randomized results, we measure the average result over ten runs. Our method performs similarly to Wu et al. [43] on the IoU measure. Wu et al. performs better on surface distance, which is less sensitive to alignment but more sensitive to details in structures. The performance of our ground truth primitives confirms that much of our reduced performance in surface distance is due to using a coarser abstraction (which though not preserving surface detail has other benefits, as discussed in introduction).
chair  table  night stand  

GT prim  0.473  0.533  0.657 
NN Baseline  0.269  0.220  0.256 
Wu et al. [43] (mean)  0.253  0.250  0.295 
3DPRNN  0.245  0.188  0.204 
3DPRNN + rot loss  0.238  0.263  0.266 
chair  table  night stand  

GT prim  0.049  0.044  0.044 
NN baseline  0.075  0.089  0.100 
Wu et al. [43] (mean)  0.045  0.035  0.057 
3DPRNN  0.074  0.080  0.104 
3DPRNN + rot loss  0.074  0.078  0.092 
Real data (NYU Depth V2). We also test our model on NYU Depth V2 dataset [36] which is much harder than synthetic due to limited training data and the fact that depth images of objects are in lower resolution, noisy, and often occluded conditions. We employ the ground truth data labelled by Guo and Hoiem [17]
, where 30 models are manually selected to represent 6 categories of common furniture: chair, table, desk, bed, bookshelf and sofa. We finetune our network that was trained on synthetic data using the training set of NYU Depth V2. We report results on test set based on the same evaluation metric as the synthetic test shown in Table
4 and 3. Since nightstand is less common in the training set and often occluded depth regions may be similar to those for tables, the network often predicts primitives in the shapes of tables or chairs for nightstands, resulting in worse performance for that class. Sample qualitative results are shown in Fig. 9.3D Shape Segmentation. Since our primitive based reconstructions are following meaningful part configurations naturally, another application where our method can apply is shape segmentation. Please refer to our supplemental material for shape segmentation task details and results, where we compare with state of the art methods as well.
class  chair  table  night stand 

GT prim  0.037  0.048  0.020 
NN baseline+ft  0.118  0.176  0.162 
NN baseline  0.101  0.164  0.160 
3DPRNN+ft  0.112  0.168  0.192 
3DPRNN  0.110  0.181  0.194 
class  chair  table  night stand 

GT prim  0.543  0.435  0.892 
NN baseline +ft  0.171  0.078  0.286 
NN baseline  0.145  0.076  0.262 
3DPRNN +ft  0.158  0.075  0.081 
3DPRNN  0.138  0.052  0.086 
Conclusions and Future Work.
We present 3DPRNN, a generative recurrent neural network that uses recurring primitive based abstractions for shape synthesis. 3DPRNN models complex shapes with a low parametric model, which advantages such as being capable of modeling shapes with fewer training examples available, and a large intra and interclass variance. Evaluations on synthetic and real depth map reconstruction tasks show that results comparable to higher degree of freedom representations can be achieved with our method. Future explorations include allowing various primitive configurations beyond cuboids (i.e. cylinders or spheres), encoding explicit joints and spatial relationship between primitives.
Acknowledgements
This research is supported in part by NSF award 1421521 and ONR MURI grant N000141612007. We thank David Forsyth for insightful comments and discussion.
References
 [1] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic. Seeing 3d chairs: exemplar partbased 2d3d alignment using a large dataset of cad models. In IEEE CVPR, pages 3762–3769, 2014.
 [2] P. J. Besl and N. D. McKay. A method for registration of 3d shapes. IEEE PAMI, 14(2):239–256, 1992.
 [3] I. Biederman. Recognitionbycomponents: a theory of human image understanding. Psychological review, 94(2):115, 1987.
 [4] C. M. Bishop. Mixture density networks. 1994.
 [5] A. BódisSzomorú, H. Riemenschneider, and L. Van Gool. Fast, approximate piecewiseplanar modeling based on sparse structurefrommotion and superpixels. In IEEE CVPR, pages 469–476, 2014.
 [6] F. Boughorbel, M. Mercimek, A. Koschan, and M. Abidi. A new method for the registration of threedimensional pointsets: The gaussian fields framework. Image and Vision Computing, 28(1):124–137, 2010.
 [7] A.L. Chauve, P. Labatut, and J.P. Pons. Robust piecewiseplanar 3d reconstruction and completion from largescale unstructured point data. In IEEE CVPR, pages 1261–1268, 2010.
 [8] T. Chen, Z. Zhu, A. Shamir, S.M. Hu, and D. CohenOr. 3sweep: Extracting editable objects from a single photo. ACM TOG, 32(6):195, 2013.
 [9] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In ECCV, pages 628–644, 2016.
 [10] S. J. Dickinson, A. Rosenfeld, and A. P. Pentland. Primitivebased shape modeling and recognition. In Visual Form, pages 213–229. 1992.

[11]
S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al.
Attend, infer, repeat: Fast scene understanding with generative models.
In Advances in Neural Information Processing Systems, pages 3225–3233, 2016.  [12] M. Firman, O. Mac Aodha, S. Julier, and G. J. Brostow. Structured prediction of unobserved voxels from a single depth image. In IEEE CVPR, pages 5431–5440, 2016.
 [13] G. Gallego and A. Yezzi. A compact formula for the derivative of a 3d rotation in exponential coordinates. Journal of Mathematical Imaging and Vision, 51(3):378–384, 2015.
 [14] R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016.
 [15] A. Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013.

[16]
K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra.
DRAW: A recurrent neural network for image generation.
In
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015
, pages 1462–1471, 2015.  [17] R. Guo and D. Hoiem. Support surface prediction in indoor scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 2144–2151, 2013.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [19] T. Igarashi, S. Matsuoka, and H. Tanaka. Teddy: a sketching interface for 3d freeform design. In ACM SIGGRAPH ’99, pages 409–416.
 [20] D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In Advances in Neural Information Processing Systems 29, pages 4996–5004. Curran Associates, Inc., 2016.
 [21] N. Kholgade, T. Simon, A. Efros, and Y. Sheikh. 3d object manipulation in a single photograph using stock 3d models. ACM TOG, 33(4):127, 2014.
 [22] V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser. Learning partbased templates from large collections of 3d shapes. ACM TOG, 32(4):70, 2013.
 [23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [24] C. Kurz, X. Wu, M. Wand, T. Thormählen, P. Kohli, and H.P. Seidel. Symmetryaware template deformation and fitting. In CGF, volume 33, pages 205–219, 2014.
 [25] F. Lafarge and P. Alliez. Surface reconstruction through point set structuring. In Computer Graphics Forum, volume 32, pages 225–234, 2013.
 [26] Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. CohenOr, and N. J. Mitra. Globfit: Consistently fitting primitives by discovering global relations. In ACM TOG, volume 30, page 52, 2011.
 [27] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In IEEE CVPR, pages 2992–2999, 2013.
 [28] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, volume 30, 2013.
 [29] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Realtime dense surface mapping and tracking. In IEEE ISMAR, pages 127–136, 2011.
 [30] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML, 28:1310–1318, 2013.
 [31] J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and D. Hoiem. Completing 3d object shape from one depth image. In IEEE CVPR, pages 2484–2493, 2015.
 [32] M. Schmidt. minfunc: unconstrained differentiable multivariate optimization in matlab. URL http://www.di.ens.fr/mschmidt/Software/minFunc.html, 2012.
 [33] R. Schmidt, B. Wyvill, M. C. Sousa, and J. A. Jorge. Shapeshop: Sketchbased solid modeling with blobtrees. In ACM SIGGRAPH 2007 courses, page 43, 2007.
 [34] R. Schnabel. Efficient pointcloud processing with primitive shapes. PhD thesis, University of Bonn, 2010.
 [35] R. Schnabel, P. Degener, and R. Klein. Completion and reconstruction with primitive shapes. In CGF, volume 28, pages 503–512, 2009.
 [36] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, pages 746–760, 2012.
 [37] S. N. Sinha, D. Steedly, R. Szeliski, M. Agrawala, and M. Pollefeys. Interactive 3d architectural modeling from unordered photo collections. In ACM TOG, volume 27, page 159, 2008.
 [38] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, pages 2377–2385, 2015.
 [39] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik. Learning shape abstractions by assembling volumetric primitives. arXiv preprint arXiv:1612.00404, 2016.
 [40] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, pages 1747–1756, 2016.
 [41] Y. Wang, S. Asafi, O. van Kaick, H. Zhang, D. CohenOr, and B. Chen. Active coanalysis of a set of shapes. ACM Transactions on Graphics (TOG), 31(6):165, 2012.
 [42] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3d interpreter network. In ECCV, pages 365–382, 2016.

[43]
Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao.
3d shapenets: A deep representation for volumetric shapes.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1912–1920, 2015.  [44] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In NIPS, pages 1696–1704, 2016.
 [45] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, A. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
 [46] R. C. Zeleznik, K. P. Herndon, and J. F. Hughes. Sketch: An interface for sketching 3d scenes. In ACM SIGGRAPH 2007 courses, page 19. ACM, 2007.
 [47] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: Lbfgsb: Fortran subroutines for largescale boundconstrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997.
Appendix A in primitive fitting energy
We define in Sec. 3.1 our primitive fitness energy as in Eq. 1. Where is the volumetric sampling ratio that is defined as the volume of the primitive over its number of sampled points :
(13) 
The product is the volume of . In our experiments, we set as a predetermined constant of .
Appendix B Derivatives of primitive fitting energy
As we stated in Sec. 3.1, Eq.1 is differentiable and can be solved using robust techniques (LBFGS[47]). In case , derivatives are analytically defined. Otherwise, is a constant and the derivatives diminish. The derivatives of Eq.1 with respect to translation , scale , and rotation in the primitive parameter set are:
(14) 
(15) 
where based on Eq. 13 we have:
(16) 
(17) 
is represented in Euler angles, thus is the derivative of rotation matrix with respect to the rotation angles. Further details of the derivatives can be found in [13].
Appendix C Evaluation on primitive fitting
We evaluate our primitive fitting method on the public semantic region dataset by Yi et al. [45]. The dataset contains detailed perpoint labeling of model parts in ShapeNetCore [43] through an efficient semisupervised labeling approach. We test our method on the test split of the chair category with 704 shapes, which contains part labeling of chair seat, back, arm and leg.
We present our result based on the common metric of labeling accuracy at triangle face level, which measures the fraction of faces with correct labels. We project the ground truth per point labeling of shape parts into per face labeling through nearest neighbor search. Given our predicted primitive representation for the voxelized 3D shape, we randomly sample points (we set number of faces) on the shape, assign each point to the segment label of the nearest predicted primitive. We project our perpoint segmentation to mesh faces based on a majority vote, i.e. if multiple points with different labels correspond to the same face, we label the face with the label suggested by the majority of the points. Since our primitive representation does not explicitly infer shape part label, we automatically relabel each part segmentation based on the majority vote of the ground truth labeling. We achieve the average face labeling accuracy of 0.843. Qualitative results are shown in Fig. 10. We observe that our primitive parsing method is able to decompose shapes into parts containing semantic meaning. Lower accuracy is often caused by 1) a single primitive that fits the shape but includes more than one type of the semantic meaning, see the shape in the second row, first column; 2) error prediction of regions with aggregation of faces, cylindershape handles of the bottom left shape contains more faces than the boxshape chair seat; 3) drawbacks of our method that fails to parse out slim shape segments, see the right bottom shape; 4) ground truth error results from projecting perpoint labeling into perface labeling.
Appendix D LSTMs sequential prediction order
The recurrent generator of 3DPRNN described in Sec. 4.1 has a predetermined parameter set prediction ordering. At each time step we sample a single instance drawn from the distribution . The sequence represents the parameters in the following order:

Time step 1, for the shape (width) and translation configuration on axis of the 1st primitive and the stopping sign.

Time step 2, for the shape (length) and translation configuration on axis of the 1st primitive and the stopping sign.

Time step 3, for the shape (height) and translation configuration on axis of the 1st primitive and the stopping sign.

Time step 4, for the shape (width) and translation configuration on axis of the 2nd primitive and the stopping sign.

Time step 5, for the shape (length) and translation configuration on axis of the 2nd primitive and the stopping sign.

(sequential prediction of )

Stop when ”End of Generation” is predicted.
Note that the above sequence is for the primitive size parameters. We simultaneously predict rotation parameter and rotation axis : at time step 1 we predict the rotation value on axis and a binary signal meaning whether there is rotation on axis or not for the first primitive, time step 2 predict and of the first primitive, then and of the first primitive. This simultaneous prediction also stops when ”End of Generation” is predicted by the main LSTM prediction sequence outlined above.
Appendix E Sampling procedure
Note that an unexpected sample that is far from the mean of the distribution will cause accumulated error to the following predictions, during testing each time we sample from the first two most possible mixture component, in training we still perform random sampling on all mixture components. This strategy improves stability of our network performance in synthetic data case. In the real data case, we found that applying random sampling among all mixture components during both training and testing time can produce successive reasonable shapes. This is due to the fact that the ground truth shapes in real data are of simple structures that is easier to model by the network.
Appendix F Additional Results
f.1 Synthetic data
Additional qualitative results of shape reconstruction from a single depth view for synthetic data are showed in Fig. 11.
f.2 Real data
Additional qualitative results of shape reconstruction from a single depth view for real data are showed in Fig. 12.
Appendix G Application: shape segmentation
Our primitive based reconstructions naturally align with semantic part configurations, and thus are directly applicable for shape segmentation tasks. To demonstrate this, we assume an input 3D shape is fully observed and use 3DPRNN to reconstruct it as a collection of primitives. We then use the resulting primitives to semantcally segment the original input shape.
Volumetric Encoder. The input 3D shape is represented as a binary voxel grid. We revise our previous depth image based encoder network to handle such voxelized input. Specifically, the first layer has kernels of size and stride , with a LeakyRelu layer of slope in the negative part. The second layer consists of kernels of size (stride ), followed by the same setting of LeakyRelu and a max pooling layer of regions. The third layer has kernels of size (stride ) followed by similar LeakyRelu and max pooling layers. The next two fullyconnected layers have and neurons respectively. The output feature vector of dimension is then fed to the recurrent generator to predict a sequence of primitives. Note that the decoder and LSTM parts of the network remain the same.
Evaluation. We evaluate the performance of 3DPRNN for the shape segmentation task on the COSEG dataset [41]. Since there is no groundtruth primitive representation of the dataset, for each shape, we automatically extract the tightest oriented box corresponding to each labeled segment and use it as a groudtruth primitive. Primitive ordering is predetermined based on the height of each box center in a decreasing manner. Similar to the training scheme in the single depth reconstruction case, we first train our network on the random split of of the data, and validate it on of the data to choose the required number of training epochs. We then train on this of the data and perform tests on the remaining of the data which has never been seen by the network.
Since the largest class of objects in COSEG is the chair category with 400 shapes, we test on this category. However, this is still a too small set for training an RNN. Hence, we first pretrain our network on ModelNet chair class with 889 shapes (with a validation split), then finetune our result on the COSEG chairs training set. This finetuning strategy increases our segmentation accuracy by . We use ADAM to update network parameters with a learning rate of , , and batch size for training and for finetuning.
Results. We present our result based on the metric same as Sec. C. We compare the segmentation obtained on the most probable generation result of 3DPRNN with the templatebased segmentation result of Kim et al. [22], which is a stateoftheart method that also fits oriented boxes to shapes for segmentation. We provide quantitative comparison in Table 5. Note that our 3DPRNN sometimes misses to predict some of the parts which numerically lowers our performance. Thus, we report both our overall performance and the average performance excluding such unpredicted parts. We also report the accuracy of the simple approach we used to generate the groundtruth primitive representations for training as this provides an upper bound for our method. In cases where 3DPRNN predicts the correct number of primitives, it outperforms the method of Kim et al. [22].
GT  Kim et al.  3DPRNN  

0.896  0.829  0.796  



0.836  0.859 