1 Introduction
Humans use a hammer by holding its handle and striking its head, not vice versa. In this simple action, people demonstrate their understanding of functional parts [37, 43]: a tool, or any object, can be decomposed into primitivebased components, each with distinct physics, functionality, and affordances [19].
How to build a machine of such competency? In this paper, we tackle the problem of physical primitive decomposition (PPD)—explaining the shape and the physics of an object with a few shape primitives with physical parameters. Given the hammer in Figure 1, our goal is to build a model that recovers its two major components: a tall, wooden cylinder for its handle, and a smaller, metal cylinder for its head.
For this task, we need a physical, partbased object shape representation that models both object geometry and physics. Groundtruth annotations for such representations are however challenging to obtain: largescale shape repositories like ShapeNet [8] often have limited annotations on object parts, let alone physics. This is mostly due to two reasons. First, annotating object parts and physics is laborintensive and requires strong domain expertise, neither of which can be offered by current crowdsourcing platforms. Second, there exist intrinsic ambiguity in the ground truth: it is impossible to precisely label underlying physical object properties like densities from only images or videos.
Let’s think more about what these representations are for. We want our object representation to faithfully encode its geometry; therefore, it should be able to explain our visual observation of the object’s appearance. Further, as the representation models object physics, it should be effective in explaining the object’s behaviors in various physical events.
Inspired by this, we propose a novel formulation that learns a partbased object representation from both visual observations and physical interactions. Starting with a single image and a voxelized shape, the model recovers the geometric primitives and infers their physical properties from texture. The physical representation inferred this way is of course rather uncertain; it therefore only serves as the model’s prior of this physical shape. Observing object behaviors in physical events offers crucial additional information, as objects with different physical properties behave differently in physical events. This is used by the model in conjunction with the prior to produce its final prediction.
We evaluate our system for physical primitive decomposition in three scenarios. First, we generate a dataset of synthetic block towers, where each block has distinct geometry and physics. Our model is able to successfully reconstruct the physical primitives by making use of both appearance and motion cues. Second, we evaluate the system on a set of synthetic tools, demonstrating its applicability to dailylife shapes. Third, we build a new dataset of real block towers in dynamic scenes, and evaluate the model’s generalization power to real videos.
We further present ablation studies to understand how each source of information contributes to the final performance. We also conduct human behavioral experiments to contrast the performance of the model with humans. In a ‘which block is heavier’ experiment, our model performs comparably to humans.
Our contributions in this paper are threefold. First, we propose the problem of physical primitive decomposition—learning a compact, disentangled object representation in terms of physical primitives. Second, we present a novel learning paradigm that learns to characterize shapes in physical primitives to explain both their geometry and physics. Third, we demonstrate that our system can achieve good performance on both synthetic and real data.
2 Related Work
PrimitiveBased 3D Representations.. Early attempts on modeling 3D shapes with primitives include decomposing them into blocks [38], generalized cylinders [6], and geons [5]
. This idea has been constantly revisited throughout the development of computer vision
[12, 14, 2]. To name a few, Gupta et al. [12] modeled scenes as qualitative blocks, and van den Hengel et al. [14] as Lego blocks. More recently, Tulsaini et al. [44] combined the new and the old—using deep convolutional network to generate primitives of a given 3D shape; later, Zou et al. proposed 3DPRNN [57], enhancing the flexibility of the system by leveraging modern advancement in recurrent generative models [45].Primitivebased representations have profound impact that goes far beyond the field of computer vision. Scientists have employed this representation for userinteractive design [17] and for teaching robots to grasp objects [33]. In the field of computer graphics, the idea of modeling shapes as primitives or parts has also been extensively explored [54, 51, 30, 21, 23, 2]. Researchers have used the partbased representation for singleimage shape reconstruction [16], shape completion [41], and probabilistic shape synthesis [15, 28].
Physical Shape and Scene Modeling. Beyond object geometry, there have been growing interests in modeling physical object properties and scene dynamics. The computer vision community has put major efforts in building rich and sizable databases. ShapeNetSem [40] is a collection of object shapes with material and physics annotations within the webscale shape repository ShapeNet [8]. Material in Context Database (MINC) [4] is a gigantic dataset of materials in the wild, associating patches in realworld images with 23 materials.
Research on physical object modeling dates back to the study of “functional parts” [37, 43, 19]. The field of learning object physics and scene dynamics has prospered in the past few years [26, 1, 20, 3, 52, 34, 36, 7, 42, 22, 29]. Among them, there are a few papers that explicitly build physical object representations [34, 47, 49, 48, 53]. Though they also focus on understanding object physics [47, 49], functionality [55, 50], and affordances [25, 11, 56], these approaches usually assume a homogeneous object with simple geometry. In our paper, we model an object using physical primitives for richer expressiveness and higher precision.
3 Physical Primitive Decomposition
3.1 Problem Statement
Both primitive decomposition and physical primitive decomposition attempt to approximate an object with primitives. We highlight their difference in Figure 2.
Primitive Decomposition. As formulated in Tulsaini et al. [44] and Zou et al. [57], primitive decomposition aims to decompose an object into a set of simple transformed primitives so that these primitives can accurately approximate its geometry shape. This task can be seen as to minimize
(1) 
where denotes the geometry shape (i.e. point cloud), and denotes the distance metric between shapes (i.e. earthmover’s distance [39]).
Physical Primitive Decomposition. In order to understand the functionality of object parts, we require the decomposed primitives to also approximate the physical behavior of object . To this end, we extend the previous objective function with an additional physics term:
(2) 
where denotes the trajectory after physics interaction , denotes the distance metric between trajectories (i.e. mean squared error), and denotes a predefined set of physics interactions. Therefore, the task of physical primitive decomposition is to minimize an overall objective function constraining both geometry and physics: , where is a weighting factor.
Below: Iron and Wood. 
Below: Two Coppers. 
3.2 PrimitiveBased Representation
We design a structured primitivebased object representation, which describes an object by listing all of its primitives with different attributes. For each primitive , we record its size , position in 3D space , rotation in quaternion form . Apart from these geometry information, we also track its physical properties: density .
In our object representation, the shape parameters, , and
, are vectors of continuous real values, whereas the density parameter
is a discrete value. We discretize the density values intoslots, so that estimating density becomes a
way classification. Discretization helps to deal with multimodal density values. Figure 2(a) shows that two parts with similar visual appearance may have very different physical parameters. In such cases, regression with an loss will encourage the model to predict the meanvalue of possible densities; in contrast, discretization allows it to give high probabilities to every possible density. We then figure out which candidate value is optimal from the trajectories.
4 Approach
In this section, we discuss our approach to the problem of physical primitive decomposition (PPD). We present an overview of our framework in Figure 4.
4.1 Overview
Inferring physical parameters from solely visual or physical observation is highly challenging. This is because two objects with different physical parameters might have similar visual appearance (Figure 2(a)) or have similar physics trajectories (Figure 2(b)). Therefore, our model takes both types of observations as input:

Visual Observation. We take a voxelized shape and an image as our input because they can provide us with valuable visual information. Voxels help us recover object geometry, and images contain texture information of object materials. Note that, even with voxels as input, it is still highly nontrivial to infer geometric parameters: the model needs to learn to segment 3D parts within the object — an unsolved problem by itself [44].

Physics Observation. In order to explain the physical behavior of an object, we also need to observe its response after some physics interactions. In this work, we choose to use 3D object trajectories rather than RGB (or RGBD) videos. Its abstractness enables the model to transfer better from synthetic to real data, because synthetic and real videos can be starkly different; in contrast, it’s easy to generate synthetic 3D trajectories that look realistic.
Specifically, our network takes a voxel , an image , and object trajectories as input. is a 3D binary voxelized grid, is a single RGB image, and consists of several object trajectories , each of which records the response to one specific physics interaction. Trajectory is a sequence of 3D object pose , where denotes the object’s center position and quaternion denotes its rotation at each time step.
After receiving the inputs, our network encodes voxel, image and trajectory with separate encoders, and sequentially predicts primitives using a recurrent primitive generator. For each primitive, the network predicts its geometry shape (i.e. scale, translation and rotation) and physical property (i.e. density). More details of our model can be found in the supplementary material.
Voxel Encoder. For input voxel , we employ a 3D volumetric convolutional network to encode the 3D shape information into a voxel feature .
Image Encoder. For input image , we pass it into the ResNet18 [13] encoder to obtain an image feature . We refer the readers to He et al. [13] for details.
Trajectory Encoder. For input trajectories , we encode each trajectory into a lowdimensional feature vector
with a separate bidirectional recurrent neural network. Specifically, we feed the trajectory sequence,
, and also the same trajectory sequence in reverse order, , into two encoding RNNs, to obtain two final hidden states: and . We take as the feature vector . Finally, we concatenate the features of each trajectory, , and project it into a lowdimensional trajectory feature with a fullyconnected layer.Primitive Generator. We concatenate the voxel feature , image feature and trajectory feature together as , and map it to a lowdimensional feature using a fullyconnected layer. We predict the set of physical primitives sequentially by a recurrent generator.
At each time step , we feed the previous generated primitive and the feature vector in as input, and we receive one hidden vector as output. Then, we compute the new primitive as
(3)  
where and are scaling factors, and is a small constant for numerical stability. Equation 3 guarantees that is in the range of , is in the range of , and is (if ignoring ), which ensures that will always be a valid primitive. In our experiments, we set , since we normalize all objects so that they can fit in unit cubes. Also note that, is an dimensional vector, where the first dimensions indicate different density values and the last two indicate the “start token” and “end token”.
Sampling and Simulating with the Physics Engine. During testing time, we treat the predicted as a multinomial distribution, and we sample multiple possible predictions from it. For each sample, we use its physical parameters to simulate the trajectory with a physics engine. Finally, we select the one whose simulated trajectory is closest to the observed trajectory.
An alternative way to incorporate physics engine is to directly optimize our model over it. As most physics engines are not differentiable, we employ REINFORCE [46]
for optimization. Empirically, we observe that this reinforcement learning based method performs worse than samplingbased methods, possible due to the large variance of the approximate gradient signals.
Simulating with a physics engine requires we know the force during testing. Such an assumption is essential to ensure the problem is wellposed: without knowing the force, we can only infer the relative part density, but not the actual values. Note that in many realworld applications such as robot manipulation, the external force is indeed available.
4.2 Loss Functions
Let and
be the predicted and groundtruth physical primitives, respectively. Our loss function consists of two terms, geometry loss
and physics loss :(4)  
(5) 
where , and are weighting factors, which are set to 1’s because , and are of the same magnitude () in our datasets. Integrating Equation 4 and Equation 5, we define the overall loss function as , where is set to ensure that and are of the same magnitude.
Part Associations. In our formulation, object parts (physical primitives) follow a predefined order (e.g., from bottom to top), and our model is encouraged to learn to predict the primitives in the same order.
5 Experiments
We evaluate our PPD model on three diverse settings: synthetic block towers where blocks are of various materials and shapes; synthetic tools with more complex geometry shapes; and real videos of block towers to demonstrate our transferability to realworld scenario.
5.1 Decomposing Block Towers
We start with decomposing block towers (stacks of blocks).
Block Towers. We build the block towers by stacking variable number of blocks (25 in our experiments) together. We first sample the size of each block and then compute the center position of blocks from bottom to top. For the ^{th} block, we denote the size as , and its center is sampled and computed by , , and , where
is a normal distribution with mean
. We illustrate some constructed block towers in Figure 5. We perform the exact voxelization with grid size of 323232 by binvox, a 3D mesh voxelizer [35].Material  Wood  Brick  Stone  Ceramic  Metal 

Density 
Materials. In our experiments, we use five different materials, and follow their realworld densities with minor modifications. The materials and the ranges of their densities are listed in Table 1. For each block in the block towers, we first assign it to one of the five materials, and then uniformly sample its density from possible values of its material. We generate 8 configurations for each block tower.
Textures. We obtain the textures for materials by cropping the center portion of images from the MINC dataset [4]. We show sample images rendered with material textures in Figure 5. Since we render the textures only with respect to the material, the images rendered do not provide any information about density.
Physics Interactions. We place the block towers at the origin and perform four physics interactions to obtain the object trajectories (). In detail, we exert a force with the magnitude of on the block tower from four predefined positions . We simulate each physics interaction for 256 time steps using the Bullet Physics Engine [9]. To ensure simulation accuracy, we set the time step for simulation to s.
Methods  Observations  Density  Trajectory  

Texture  Physics  Accuracy  RMSE  MAE  
Top 1  Top 5  Top 10  
Frequent  –  –  2.0  9.7  13.4  25.4  74.4 
Nearest  –  +  1.9  7.9  12.4  41.1  91.0 
Oracle  +  –  6.9  35.7  72.0  18.5  51.3 
PPD (no trajectory)  +  –  7.2  35.2  69.5  19.0  51.7 
PPD (no image)  –  +  7.1  31.0  50.8  16.7  36.4 
PPD (no voxels)  +  +  15.9  56.3  82.4  10.3  29.9 
PPD (RGBD)  +  +  11.6  50.5  79.5  12.8  30.2 
PPD (full)  +  +  16.1  56.4  82.5  9.9  21.0 
PPD (full)+Sample  +  +  18.2  59.7  84.0  8.8  13.9 
Metrics. We evaluate the performance of shape reconstruction by the score between the prediction and ground truth: each primitive in prediction is labeled as a true positive if its intersection over union (IoU) with a groundtruth primitive is greater than 0.5. For physics estimation, we employ two types of metrics, i) density measures: top accuracy () and rootmeansquare error (RMSE) and ii) trajectory measure: meanabsolute error (MAE) between simulated trajectory (using predicted the physical parameters) and groundtruth trajectory.
Methods. We evaluate our model with different combinations of observations as input: i) texture only (i.e., no trajectory, by setting ), ii) physics only (i.e., no image, by setting ), iii) both texture and physics but without the voxelized shape, iv) both texture and physics but with replacing the 3D trajectory with a raw depth video, v) full data in our original setup (image, voxels, and trajectory). We also compare our model with several baselines: i) predicting the most frequent density in the training set (Frequent), ii) nearest neighbor retrieval from the training set (Nearest), and iii) knowing the groundtruth material and guessing within its density value range (Oracle). While all these baselines assume perfect shape reconstruction, our model learns to decompose the shape.
Results. For the shape reconstruction, our model achieves 97.5 in terms of F1 score. For the physics estimation, we present quantitative results of our model with different observations as input in Table 2. We compare our model with an oracle that infers material properties from appearance while assuming groundtruth reconstruction. It gives upperbound performance of methods that rely on only appearance cues. Experiments suggest that appearance alone is not sufficient for density estimation. From Table 2, we observe that combining appearance with physics performs well on physical parameter estimation, which is because the object trajectories can provide crucial additional information about the density distribution (i.e
. moment of inertia). Also, all input modalities and sampling contribute to the model’s final performance.
1  8  64  512  

Sample Phys.+Shape  142.2  87.1  70.8  58.7 
Sample Phys.  89.7  60.1  38.7  22.7 
PPD (ours)  21.0  15.1  13.9  13.2 
We have also implemented a physics engine–based sampling baseline: sampling the shape and physical parameters for each primitive, using a physics engine for simulation, and selecting the one whose trajectory is closest to the observation. We also compare with a stronger baseline where we only sample physics, assuming groundtruth shape is known. Table 3 shows our model works better and is more efficient: the neural nets have learned an informative prior that greatly reduces the need of sampling at test time.
5.2 Decomposing Tools
We then demonstrate the practical applicability of our model by decomposing synthetic realworld tools.
Tools. Because of the absence of tool data in the ShapeNet Core [8] dataset, we download the tools from 3D Warehouse^{*}^{*}*https://3dwarehouse.sketchup.com and manually remove all unrelated models. In total, there are 204 valid tools, and we use Blender to remesh and clean up these tools to fix the issues with missing faces and normals. Following Chang et al. [8], we perform PCA on the point clouds and align models by their PCA axes. Sample tools in our dataset are shown in Figure 6.
Methods  Observations  Density  Trajectory  
Texture  Physics  Accuracy  RMSE  MAE  
Top 1  Top 5  Top 10  
Frequent  –  –  2.5  10.2  13.6  25.9  348.2 
Nearest  –  +  2.9  8.3  12.4  25.8  329.7 
Oracle  +  –  7.4  35.2  72.0  19.1  185.8 
PPD (no trajectory)  +  –  7.7  36.4  71.1  16.8  206.8 
PPD (no image)  –  +  15.0  56.3  80.2  5.9  143.6 
PPD (full)  +  +  35.7  85.2  95.8  2.6  103.6 
PPD (full)+Sample  +  +  38.3  85.0  96.1  2.5  74.4 
Primitives. Similar to Zou et al. [57], we first use the energybased optimization to fit the primitives from the point clouds, and then, we assign each vertex to its nearest primitive and refine each primitive with the minimum oriented bounding box of vertices assigned to it.
Other Setups. We make use of the same set of materials and densities as in Table 1 and the same textures for materials as described in Section 5.1. Sample images rendered with textures are shown in Figure 6. As for physics interactions, we follow the same scenario configurations as in Section 5.1.
Training Details. Because the size of synthetic tools dataset is rather limited, we first pretrain our PPD model on the block towers and then finetune it on the synthetic tools. For the block towers used for pretraining, we fix the number of blocks to 2 and introduce small random noises and rotations to each block to fill the gap between block towers and synthetic tools.
Results. For the shape reconstruction, our model achieves 85.9 in terms of F1 score. For the physics estimation, we present quantitative results in Table 4. The shape reconstruction is not as good as that of the block towers dataset because the synthetic tools are more complicated, and the orientations might introduce some ambiguity (there might exist multiple bounding boxes with different rotations for the same part of object). The physics estimation performance is better since the number of primitives in our synthetic tools dataset is very small (2 in general). We also show some qualitative results in Figure 6.
5.3 Decomposing Real Objects
We look into real objects to evaluate the generalization ability of our model.
RealWorld Block Towers. We purchase totally ten sets of blocks with different materials (i.e. pine, steel, aluminum and copper) from Amazon, and construct a dataset of realworld block towers. Our dataset contains 16 block towers with different configurations: 8 with two blocks, 4 with three blocks, and another 4 with four blocks.
Physics Interaction. The scenario is set up as follows: the block tower is placed at a specific position on the desk, and we use a copper ball (hang by a pendulum) to hit it. In Figure 7, we show some objects and their trajectories in our dataset.
Video to 3D Trajectory. On realworld data, the appearance of every frame in RGB video is used to extract a 3D trajectory. A major challenge is how to convert RGB videos into 3D trajectories. We employ the following approach:

Tracking 2D Keypoints. For each frame, we first detect the 2D positions of object corners. For simplicity, we mark the object corners using red stickers and use a simple color filter to determine the corner positions. Then, we find the correspondence between the corner points from consecutive frames by solving the minimumdistance matching between two sets of points. After aligning the corner points in different frames, we obtain the 2D trajectories of these keypoints.

Reconstructing 3D Poses. We annotate the 3D position for each corner point. Then, for each frame, we have 2D locations of keypoints and their corresponding 3D locations. Finally, we reconstruct the 3D object pose in each frame by solving the PerspectivenPoint between 2D and 3D locations using LevenbergMarquardt algorithm [27, 32].






Training Details. We build a virtual physics environment, similar to our realworld setup, in the Bullet Physics Engine [9]. We employ it to simulate physics interactions and generate a dataset of synthetic block towers to train our model.
Results. We show some qualitative results of our model with different observations as input in Figure 8. In the realworld setup, with only texture or physics information, our model cannot effectively predict the physical parameters because images and object trajectories are much noisier than those in synthetic dataset, while combining them together indeed helps it to predict much more accurate results. In terms of quantitative evaluation, our model (with both observations as input) achieves an RMSE value of 18.7 over the whole dataset and 10.1 over the block towers with two blocks (the RMSE value of random guessing is 40.8).
6 Analysis
To better understand our model, we present several analysis. The first three are conducted on synthetic block towers and the last one is on our real dataset.
Learning Speed with Different Supervisions. We show the learning curves of our PPD model with different supervision in Figure 9. Model supervised by physics observation reaches the same level of performance of model with texture supervision using much fewer training steps (500K vs. 2M). Supervised by both observations, our PPD model preserves the learning speed of the model with only physics supervision, and further improves its performance.
Preference over Possible Values. We illustrate the confusion matrices of physical parameter estimation in Figure 10. Although our PPD model performs similarly either with only texture as input or with physics as input, its preferences over all possible values turn out to be quite different. With texture as input (in Figure 9(a)), it tends to guess within the possible values of the corresponding material (see Table 1), while with physics as input (in Figure 9(b)), it only makes errors between very close values. Therefore, the information provided by two types of inputs is orthogonal to each other (in Figure 9(c)).
Impact of Primitive Numbers. As demonstrated in Table 5, the number of blocks has nearly no influence on the model with texture as input. With physics interactions as input, the model performs much better on fewer blocks, and its performance degrades when the number of blocks starts increasing. The degradation is probably because the physical response of any rigid body is fully characterized by a few object properties (i.e., total mass, center of mass, and moment of inertia), which provides us with limited constraints on the density distribution of an object when the number of primitives is relatively large.
Observation  2 blocks  3 blocks  4 blocks  5 blocks  Overall 

Texture  18.2  18.5  18.8  19.7  19.1 
Physics  3.6  7.9  15.8  20.0  14.7 
Texture + Physics  2.3  4.9  7.8  10.9  8.0 
Human Studies. We select the block towers with two blocks from our real dataset, and study the problem of “which block is heavier” upon them. The human studies are conducted on the Amazon Mechanical Turk. For each block tower, we provide 25 annotators with an image and a video of physics interaction, and ask them to estimate the ratio of mass between the upper and the lower block. Instead of directly predicting a real value, we require the annotators to make a choice on a log scale, i.e., from . Results of average human’s predictions, model’s predictions and the truths are shown in Figure 11. Our model performs comparably to humans, and its response is also highly correlated with humans: the Pearson’s coefficient of “Human vs. Model”, “Human vs. Truth” and “Model vs. Truth” is 0.69, 0.71 and 0.90, respectively.
7 Conclusion
In this paper, we have formulated and studied the problem of physical primitive decomposition (PPD), which is to approximate an object with a set of primitives, explaining its geometry and physics. To this end, we proposed a novel formulation that takes both visual and physics observations as input. We evaluated our model on several different setups: synthetic block towers, synthetic tools and realworld objects. Our model achieved good performance on both synthetic and real data.
Acknowledgements: This work is supported by NSF #1231216, ONR MURI N000141612007, Toyota Research Institute, and Facebook.
References
 [1] Agrawal, P., Nair, A., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: Experiential learning of intuitive physics. In: NIPS (2016)
 [2] Attene, M., Falcidieno, B., Spagnuolo, M.: Hierarchical mesh segmentation based on fitting primitives. The Visual Computer 22(3), 181–193 (2006)

[3]
Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physical scene understanding. PNAS
110(45), 18327–18332 (2013)  [4] Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: CVPR (2015)
 [5] Biederman, I.: Recognitionbycomponents: a theory of human image understanding. Psychol. Rev. 94(2), 115 (1987)
 [6] Binford, T.O.: Visual perception by computer. In: IEEE Conf. on Systems and Control (1971)
 [7] Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Physicsbased person tracking using the anthropomorphic walker. IJCV 87(12), 140 (2010)
 [8] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An informationrich 3d model repository. arXiv:1512.03012 (2015)
 [9] Coumans, E.: Bullet physics engine. Open Source Software: http://bulletphysics. org (2010)

[10]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., FeiFei, L.: Imagenet: A largescale hierarchical image database. In: CVPR (2009)
 [11] Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: CVPR (2011)
 [12] Gupta, A., Efros, A.A., Hebert, M.: Blocks world revisited: Image understanding using qualitative geometry and mechanics. In: ECCV (2010)
 [13] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2015)
 [14] van den Hengel, A., Russell, C., Dick, A., Bastian, J., Pooley, D., Fleming, L., Agapito, L.: Partbased modelling of compound scenes from images. In: CVPR (2015)

[15]
Huang, H., Kalogerakis, E., Marlin, B.: Analysis and synthesis of 3D shape families via deeplearned generative models of surfaces. CGF
34(5), 25–38 (2015)  [16] Huang, Q., Wang, H., Koltun, V.: Singleview reconstruction via joint analysis of image and shape collections. ACM TOG 34(4), 87 (2015)
 [17] Igarashi, T., Matsuoka, S., Tanaka, H.: Teddy: a sketching interface for 3d freeform design. In: SIGGRAPH (1999)

[18]
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
 [19] J. Gibson, J.: The theory of affordances. The Ecological Approach to Visual Perception 8. (1977)
 [20] Jia, Z., Gallagher, A., Saxena, A., Chen, T.: 3d reasoning from blocks to stability. IEEE TPAMI 37(5), 905–918 (2015)
 [21] Kalogerakis, E., Chaudhuri, S., Koller, D., Koltun, V.: A probabilistic model for componentbased shape synthesis. ACM TOG 31(4), 55 (2012)
 [22] Kim, M., PonsMoll, G., Pujades, S., Bang, S., Kim, J., Black, M.J., Lee, S.H.: Datadriven physics for human soft tissue animation. In: SIGGRAPH (2017)
 [23] Kim, V.G., Li, W., Mitra, N.J., Chaudhuri, S., DiVerdi, S., Funkhouser, T.: Learning partbased templates from large collections of 3d shapes. ACM TOG 32(4), 70 (2013)
 [24] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
 [25] Koppula, H.S., Saxena, A.: Physically grounded spatiotemporal object affordances. In: ECCV (2014)
 [26] Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: ICML (2016)
 [27] Levenberg, K.: A method for the solution of certain nonlinear problems in least squares. Quarterly of applied mathematics 2(2), 164–168 (1944)

[28]
Li, J., Xu, K., Chaudhuri, S., Yumer, E., Zhang, H., Guibas, L.: Grass: Generative recursive autoencoders for shape structures. In: SIGGRAPH (2017)
 [29] Li, W., Leonardis, A., Fritz, M.: Visual stability prediction for robotic manipulation. In: ICRA (2017)
 [30] Li, Y., Wu, X., Chrysathou, Y., Sharf, A., CohenOr, D., Mitra, N.J.: Globfit: Consistently fitting primitives by discovering global relations. ACM TOG 30(4), 52 (2011)
 [31] Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)
 [32] Marquardt, D.W.: An algorithm for leastsquares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics 11(2), 431–441 (1963)
 [33] Miller, A.T., Knoop, S., Christensen, H.I., Allen, P.K.: Automatic grasp planning using shape primitives. In: ICRA (2003)
 [34] Mottaghi, R., Rastegari, M., Gupta, A., Farhadi, A.: “what happens if…” learning to predict the effect of forces in images. In: ECCV (2016)
 [35] Nooruddin, F.S., Turk, G.: Simplification and repair of polygonal models using volumetric techniques. IEEE TVCG 9(2), 191–205 (2003)
 [36] Pham, T.H., Kheddar, A., Qammaz, A., Argyros, A.A.: Towards force sensing from vision: Observing handobject interactions to infer manipulation forces. In: CVPR (2015)
 [37] Rivlin, E., Dickinson, S.J., Rosenfeld, A.: Recognition by functional parts. CVIU 62(2), 164–176 (1995)
 [38] Roberts, L.G.: Machine perception of threedimensional solids. Ph.D. thesis, Massachusetts Institute of Technology (1963)

[39]
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV
40(2), 99–121 (2000)  [40] Savva, M., Chang, A.X., Hanrahan, P.: Semanticallyenriched 3d models for commonsense knowledge. In: CVPR Workshop (2015)
 [41] Schnabel, R., Degener, P., Klein, R.: Completion and reconstruction with primitive shapes. CGF 28(2), 503–512 (2009)
 [42] Soo Park, H., Shi, J., et al.: Force from motion: decoding physical sensation in a first person video. In: CVPR (2016)
 [43] Tenenbaum, J.B.: Functional parts. In: CogSci (1994)
 [44] Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. In: CVPR (2017)
 [45] Van Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: ICML (2016)
 [46] Williams, R.J.: Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. MLJ 8(34), 229–256 (1992)
 [47] Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: Learning physical object properties from unlabeled videos. In: BMVC (2016)
 [48] Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual deanimation. In: NIPS (2017)
 [49] Wu, J., Yildirim, I., Lim, J.J., Freeman, W.T., Tenenbaum, J.B.: Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: NIPS (2015)
 [50] Yao, B., Ma, J., FeiFei, L.: Discovering object functionality. In: ICCV (2013)
 [51] Yumer, M.E., Kara, L.B.: Coabstraction of shape collections. ACM TOG 31(6), 166 (2012)
 [52] Zhao, Y., Zhu, S.C.: Scene parsing by integrating function, geometry and appearance models. In: CVPR (2013)

[53]
Zheng, D., Luo, V., Wu, J., Tenenbaum, J.B.: Unsupervised learning of latent physical properties using perceptionprediction networks. In: UAI (2018)
 [54] Zheng, Y., CohenOr, D., Averkiou, M., Mitra, N.J.: Recurring part arrangements in shape collections. CGF 33(2), 115–124 (2014)
 [55] Zhu, Y., Zhao, Y., Zhu, S.C.: Understanding tools: Taskoriented object modeling, learning and recognition. In: CVPR (2015)
 [56] Zhu, Y., Fathi, A., FeiFei, L.: Reasoning about object affordances in a knowledge base representation. In: ECCV (2014)
 [57] Zou, C., Yumer, E., Yang, J., Ceylan, D., Hoiem, D.: 3dprnn: Generating shape primitives with recurrent neural networks. In: ICCV (2017)
Appendix A.1 Implementation Details
We present some implementation details about network architecture and training.
3D ConvNet. As the building block of voxel encoder, this network consists of five volumetric convolutional layers, with numbers of channels , kernel sizes 33
3, and padding sizes 1. Between convolutional layers, we add batch normalization
[18], Leaky ReLU
[31]with slope 0.2 and maxpooling of size 2
22. At the end of the network, we append two additional 111 volumetric convolutional layers.Network Details. As the inputs fed into different encoders, voxels , images and trajectories are of size 1323232, 3224224 and 2567, respectively. The dimensions of output features from encoders, , and
, are all 64. Inside both trajectory encoder and primitive generator, we employ the Long ShortTerm Memory (LSTM) cell with hidden sizes of 64 and dropout rates of 0.5 as recurrent unit. The trajectory encoder uses a singlelayer recurrent neural network, while the primitive generator applies three layers of recurrently connected units.
Training Details.
We implement our PPD model in PyTorch
^{†}^{†}†http://pytorch.org. For the image encoder, we make use of the weights of ResNet18 [13] pretrained on ImageNet [10] and replace its final classification layer with a fullyconnected layer, while for other modules, we initialize their weights randomly. During optimization, we first train the geometric parameters (by setting to 0), and then we train all parameters jointly. Optimization is carried out using ADAM [24] with and . We use a learning rate of and minibatch size of 8.
Comments
There are no comments yet.