Humans use a hammer by holding its handle and striking its head, not vice versa. In this simple action, people demonstrate their understanding of functional parts [37, 43]: a tool, or any object, can be decomposed into primitive-based components, each with distinct physics, functionality, and affordances .
How to build a machine of such competency? In this paper, we tackle the problem of physical primitive decomposition (PPD)—explaining the shape and the physics of an object with a few shape primitives with physical parameters. Given the hammer in Figure 1, our goal is to build a model that recovers its two major components: a tall, wooden cylinder for its handle, and a smaller, metal cylinder for its head.
For this task, we need a physical, part-based object shape representation that models both object geometry and physics. Ground-truth annotations for such representations are however challenging to obtain: large-scale shape repositories like ShapeNet  often have limited annotations on object parts, let alone physics. This is mostly due to two reasons. First, annotating object parts and physics is labor-intensive and requires strong domain expertise, neither of which can be offered by current crowdsourcing platforms. Second, there exist intrinsic ambiguity in the ground truth: it is impossible to precisely label underlying physical object properties like densities from only images or videos.
Let’s think more about what these representations are for. We want our object representation to faithfully encode its geometry; therefore, it should be able to explain our visual observation of the object’s appearance. Further, as the representation models object physics, it should be effective in explaining the object’s behaviors in various physical events.
Inspired by this, we propose a novel formulation that learns a part-based object representation from both visual observations and physical interactions. Starting with a single image and a voxelized shape, the model recovers the geometric primitives and infers their physical properties from texture. The physical representation inferred this way is of course rather uncertain; it therefore only serves as the model’s prior of this physical shape. Observing object behaviors in physical events offers crucial additional information, as objects with different physical properties behave differently in physical events. This is used by the model in conjunction with the prior to produce its final prediction.
We evaluate our system for physical primitive decomposition in three scenarios. First, we generate a dataset of synthetic block towers, where each block has distinct geometry and physics. Our model is able to successfully reconstruct the physical primitives by making use of both appearance and motion cues. Second, we evaluate the system on a set of synthetic tools, demonstrating its applicability to daily-life shapes. Third, we build a new dataset of real block towers in dynamic scenes, and evaluate the model’s generalization power to real videos.
We further present ablation studies to understand how each source of information contributes to the final performance. We also conduct human behavioral experiments to contrast the performance of the model with humans. In a ‘which block is heavier’ experiment, our model performs comparably to humans.
Our contributions in this paper are three-fold. First, we propose the problem of physical primitive decomposition—learning a compact, disentangled object representation in terms of physical primitives. Second, we present a novel learning paradigm that learns to characterize shapes in physical primitives to explain both their geometry and physics. Third, we demonstrate that our system can achieve good performance on both synthetic and real data.
2 Related Work
. This idea has been constantly revisited throughout the development of computer vision[12, 14, 2]. To name a few, Gupta et al.  modeled scenes as qualitative blocks, and van den Hengel et al.  as Lego blocks. More recently, Tulsaini et al.  combined the new and the old—using deep convolutional network to generate primitives of a given 3D shape; later, Zou et al. proposed 3D-PRNN , enhancing the flexibility of the system by leveraging modern advancement in recurrent generative models .
Primitive-based representations have profound impact that goes far beyond the field of computer vision. Scientists have employed this representation for user-interactive design  and for teaching robots to grasp objects . In the field of computer graphics, the idea of modeling shapes as primitives or parts has also been extensively explored [54, 51, 30, 21, 23, 2]. Researchers have used the part-based representation for single-image shape reconstruction , shape completion , and probabilistic shape synthesis [15, 28].
Physical Shape and Scene Modeling. Beyond object geometry, there have been growing interests in modeling physical object properties and scene dynamics. The computer vision community has put major efforts in building rich and sizable databases. ShapeNet-Sem  is a collection of object shapes with material and physics annotations within the web-scale shape repository ShapeNet . Material in Context Database (MINC)  is a gigantic dataset of materials in the wild, associating patches in real-world images with 23 materials.
Research on physical object modeling dates back to the study of “functional parts” [37, 43, 19]. The field of learning object physics and scene dynamics has prospered in the past few years [26, 1, 20, 3, 52, 34, 36, 7, 42, 22, 29]. Among them, there are a few papers that explicitly build physical object representations [34, 47, 49, 48, 53]. Though they also focus on understanding object physics [47, 49], functionality [55, 50], and affordances [25, 11, 56], these approaches usually assume a homogeneous object with simple geometry. In our paper, we model an object using physical primitives for richer expressiveness and higher precision.
3 Physical Primitive Decomposition
3.1 Problem Statement
Both primitive decomposition and physical primitive decomposition attempt to approximate an object with primitives. We highlight their difference in Figure 2.
Primitive Decomposition. As formulated in Tulsaini et al.  and Zou et al. , primitive decomposition aims to decompose an object into a set of simple transformed primitives so that these primitives can accurately approximate its geometry shape. This task can be seen as to minimize
where denotes the geometry shape (i.e. point cloud), and denotes the distance metric between shapes (i.e. earth-mover’s distance ).
Physical Primitive Decomposition. In order to understand the functionality of object parts, we require the decomposed primitives to also approximate the physical behavior of object . To this end, we extend the previous objective function with an additional physics term:
where denotes the trajectory after physics interaction , denotes the distance metric between trajectories (i.e. mean squared error), and denotes a predefined set of physics interactions. Therefore, the task of physical primitive decomposition is to minimize an overall objective function constraining both geometry and physics: , where is a weighting factor.
Below: Iron and Wood.
Below: Two Coppers.
3.2 Primitive-Based Representation
We design a structured primitive-based object representation, which describes an object by listing all of its primitives with different attributes. For each primitive , we record its size , position in 3D space , rotation in quaternion form . Apart from these geometry information, we also track its physical properties: density .
In our object representation, the shape parameters, , and
, are vectors of continuous real values, whereas the density parameteris a discrete value. We discretize the density values into
slots, so that estimating density becomes a-way classification. Discretization helps to deal with multi-modal density values. Figure 2(a) shows that two parts with similar visual appearance may have very different physical parameters. In such cases, regression with an loss will encourage the model to predict the mean
value of possible densities; in contrast, discretization allows it to give high probabilities to every possible density. We then figure out which candidate value is optimal from the trajectories.
In this section, we discuss our approach to the problem of physical primitive decomposition (PPD). We present an overview of our framework in Figure 4.
Inferring physical parameters from solely visual or physical observation is highly challenging. This is because two objects with different physical parameters might have similar visual appearance (Figure 2(a)) or have similar physics trajectories (Figure 2(b)). Therefore, our model takes both types of observations as input:
Visual Observation. We take a voxelized shape and an image as our input because they can provide us with valuable visual information. Voxels help us recover object geometry, and images contain texture information of object materials. Note that, even with voxels as input, it is still highly nontrivial to infer geometric parameters: the model needs to learn to segment 3D parts within the object — an unsolved problem by itself .
Physics Observation. In order to explain the physical behavior of an object, we also need to observe its response after some physics interactions. In this work, we choose to use 3D object trajectories rather than RGB (or RGB-D) videos. Its abstractness enables the model to transfer better from synthetic to real data, because synthetic and real videos can be starkly different; in contrast, it’s easy to generate synthetic 3D trajectories that look realistic.
Specifically, our network takes a voxel , an image , and object trajectories as input. is a 3D binary voxelized grid, is a single RGB image, and consists of several object trajectories , each of which records the response to one specific physics interaction. Trajectory is a sequence of 3D object pose , where denotes the object’s center position and quaternion denotes its rotation at each time step.
After receiving the inputs, our network encodes voxel, image and trajectory with separate encoders, and sequentially predicts primitives using a recurrent primitive generator. For each primitive, the network predicts its geometry shape (i.e. scale, translation and rotation) and physical property (i.e. density). More details of our model can be found in the supplementary material.
Voxel Encoder. For input voxel , we employ a 3D volumetric convolutional network to encode the 3D shape information into a voxel feature .
Trajectory Encoder. For input trajectories , we encode each trajectory into a low-dimensional feature vector
with a separate bi-directional recurrent neural network. Specifically, we feed the trajectory sequence,, and also the same trajectory sequence in reverse order, , into two encoding RNNs, to obtain two final hidden states: and . We take as the feature vector . Finally, we concatenate the features of each trajectory, , and project it into a low-dimensional trajectory feature with a fully-connected layer.
Primitive Generator. We concatenate the voxel feature , image feature and trajectory feature together as , and map it to a low-dimensional feature using a fully-connected layer. We predict the set of physical primitives sequentially by a recurrent generator.
At each time step , we feed the previous generated primitive and the feature vector in as input, and we receive one hidden vector as output. Then, we compute the new primitive as
where and are scaling factors, and is a small constant for numerical stability. Equation 3 guarantees that is in the range of , is in the range of , and is (if ignoring ), which ensures that will always be a valid primitive. In our experiments, we set , since we normalize all objects so that they can fit in unit cubes. Also note that, is an -dimensional vector, where the first dimensions indicate different density values and the last two indicate the “start token” and “end token”.
Sampling and Simulating with the Physics Engine. During testing time, we treat the predicted as a multinomial distribution, and we sample multiple possible predictions from it. For each sample, we use its physical parameters to simulate the trajectory with a physics engine. Finally, we select the one whose simulated trajectory is closest to the observed trajectory.
An alternative way to incorporate physics engine is to directly optimize our model over it. As most physics engines are not differentiable, we employ REINFORCE 
for optimization. Empirically, we observe that this reinforcement learning based method performs worse than sampling-based methods, possible due to the large variance of the approximate gradient signals.
Simulating with a physics engine requires we know the force during testing. Such an assumption is essential to ensure the problem is well-posed: without knowing the force, we can only infer the relative part density, but not the actual values. Note that in many real-world applications such as robot manipulation, the external force is indeed available.
4.2 Loss Functions
be the predicted and ground-truth physical primitives, respectively. Our loss function consists of two terms, geometry lossand physics loss :
where , and are weighting factors, which are set to 1’s because , and are of the same magnitude () in our datasets. Integrating Equation 4 and Equation 5, we define the overall loss function as , where is set to ensure that and are of the same magnitude.
Part Associations. In our formulation, object parts (physical primitives) follow a pre-defined order (e.g., from bottom to top), and our model is encouraged to learn to predict the primitives in the same order.
We evaluate our PPD model on three diverse settings: synthetic block towers where blocks are of various materials and shapes; synthetic tools with more complex geometry shapes; and real videos of block towers to demonstrate our transferability to real-world scenario.
5.1 Decomposing Block Towers
We start with decomposing block towers (stacks of blocks).
Block Towers. We build the block towers by stacking variable number of blocks (2-5 in our experiments) together. We first sample the size of each block and then compute the center position of blocks from bottom to top. For the th block, we denote the size as , and its center is sampled and computed by , , and , where
is a normal distribution with mean. We illustrate some constructed block towers in Figure 5. We perform the exact voxelization with grid size of 323232 by binvox, a 3D mesh voxelizer .
Materials. In our experiments, we use five different materials, and follow their real-world densities with minor modifications. The materials and the ranges of their densities are listed in Table 1. For each block in the block towers, we first assign it to one of the five materials, and then uniformly sample its density from possible values of its material. We generate 8 configurations for each block tower.
Textures. We obtain the textures for materials by cropping the center portion of images from the MINC dataset . We show sample images rendered with material textures in Figure 5. Since we render the textures only with respect to the material, the images rendered do not provide any information about density.
Physics Interactions. We place the block towers at the origin and perform four physics interactions to obtain the object trajectories (). In detail, we exert a force with the magnitude of on the block tower from four pre-defined positions . We simulate each physics interaction for 256 time steps using the Bullet Physics Engine . To ensure simulation accuracy, we set the time step for simulation to s.
|Top 1||Top 5||Top 10|
|PPD (no trajectory)||+||–||7.2||35.2||69.5||19.0||51.7|
|PPD (no image)||–||+||7.1||31.0||50.8||16.7||36.4|
|PPD (no voxels)||+||+||15.9||56.3||82.4||10.3||29.9|
Metrics. We evaluate the performance of shape reconstruction by the score between the prediction and ground truth: each primitive in prediction is labeled as a true positive if its intersection over union (IoU) with a ground-truth primitive is greater than 0.5. For physics estimation, we employ two types of metrics, i) density measures: top- accuracy () and root-mean-square error (RMSE) and ii) trajectory measure: mean-absolute error (MAE) between simulated trajectory (using predicted the physical parameters) and ground-truth trajectory.
Methods. We evaluate our model with different combinations of observations as input: i) texture only (i.e., no trajectory, by setting ), ii) physics only (i.e., no image, by setting ), iii) both texture and physics but without the voxelized shape, iv) both texture and physics but with replacing the 3D trajectory with a raw depth video, v) full data in our original setup (image, voxels, and trajectory). We also compare our model with several baselines: i) predicting the most frequent density in the training set (Frequent), ii) nearest neighbor retrieval from the training set (Nearest), and iii) knowing the ground-truth material and guessing within its density value range (Oracle). While all these baselines assume perfect shape reconstruction, our model learns to decompose the shape.
Results. For the shape reconstruction, our model achieves 97.5 in terms of F1 score. For the physics estimation, we present quantitative results of our model with different observations as input in Table 2. We compare our model with an oracle that infers material properties from appearance while assuming ground-truth reconstruction. It gives upper-bound performance of methods that rely on only appearance cues. Experiments suggest that appearance alone is not sufficient for density estimation. From Table 2, we observe that combining appearance with physics performs well on physical parameter estimation, which is because the object trajectories can provide crucial additional information about the density distribution (i.e
. moment of inertia). Also, all input modalities and sampling contribute to the model’s final performance.
We have also implemented a physics engine–based sampling baseline: sampling the shape and physical parameters for each primitive, using a physics engine for simulation, and selecting the one whose trajectory is closest to the observation. We also compare with a stronger baseline where we only sample physics, assuming ground-truth shape is known. Table 3 shows our model works better and is more efficient: the neural nets have learned an informative prior that greatly reduces the need of sampling at test time.
5.2 Decomposing Tools
We then demonstrate the practical applicability of our model by decomposing synthetic real-world tools.
Tools. Because of the absence of tool data in the ShapeNet Core  dataset, we download the tools from 3D Warehouse***https://3dwarehouse.sketchup.com and manually remove all unrelated models. In total, there are 204 valid tools, and we use Blender to remesh and clean up these tools to fix the issues with missing faces and normals. Following Chang et al. , we perform PCA on the point clouds and align models by their PCA axes. Sample tools in our dataset are shown in Figure 6.
|Top 1||Top 5||Top 10|
|PPD (no trajectory)||+||–||7.7||36.4||71.1||16.8||206.8|
|PPD (no image)||–||+||15.0||56.3||80.2||5.9||143.6|
Primitives. Similar to Zou et al. , we first use the energy-based optimization to fit the primitives from the point clouds, and then, we assign each vertex to its nearest primitive and refine each primitive with the minimum oriented bounding box of vertices assigned to it.
Other Setups. We make use of the same set of materials and densities as in Table 1 and the same textures for materials as described in Section 5.1. Sample images rendered with textures are shown in Figure 6. As for physics interactions, we follow the same scenario configurations as in Section 5.1.
Training Details. Because the size of synthetic tools dataset is rather limited, we first pre-train our PPD model on the block towers and then finetune it on the synthetic tools. For the block towers used for pre-training, we fix the number of blocks to 2 and introduce small random noises and rotations to each block to fill the gap between block towers and synthetic tools.
Results. For the shape reconstruction, our model achieves 85.9 in terms of F1 score. For the physics estimation, we present quantitative results in Table 4. The shape reconstruction is not as good as that of the block towers dataset because the synthetic tools are more complicated, and the orientations might introduce some ambiguity (there might exist multiple bounding boxes with different rotations for the same part of object). The physics estimation performance is better since the number of primitives in our synthetic tools dataset is very small (2 in general). We also show some qualitative results in Figure 6.
5.3 Decomposing Real Objects
We look into real objects to evaluate the generalization ability of our model.
Real-World Block Towers. We purchase totally ten sets of blocks with different materials (i.e. pine, steel, aluminum and copper) from Amazon, and construct a dataset of real-world block towers. Our dataset contains 16 block towers with different configurations: 8 with two blocks, 4 with three blocks, and another 4 with four blocks.
Physics Interaction. The scenario is set up as follows: the block tower is placed at a specific position on the desk, and we use a copper ball (hang by a pendulum) to hit it. In Figure 7, we show some objects and their trajectories in our dataset.
Video to 3D Trajectory. On real-world data, the appearance of every frame in RGB video is used to extract a 3D trajectory. A major challenge is how to convert RGB videos into 3D trajectories. We employ the following approach:
Tracking 2D Keypoints. For each frame, we first detect the 2D positions of object corners. For simplicity, we mark the object corners using red stickers and use a simple color filter to determine the corner positions. Then, we find the correspondence between the corner points from consecutive frames by solving the minimum-distance matching between two sets of points. After aligning the corner points in different frames, we obtain the 2D trajectories of these keypoints.
Reconstructing 3D Poses. We annotate the 3D position for each corner point. Then, for each frame, we have 2D locations of keypoints and their corresponding 3D locations. Finally, we reconstruct the 3D object pose in each frame by solving the Perspective-n-Point between 2D and 3D locations using Levenberg-Marquardt algorithm [27, 32].
Training Details. We build a virtual physics environment, similar to our real-world setup, in the Bullet Physics Engine . We employ it to simulate physics interactions and generate a dataset of synthetic block towers to train our model.
Results. We show some qualitative results of our model with different observations as input in Figure 8. In the real-world setup, with only texture or physics information, our model cannot effectively predict the physical parameters because images and object trajectories are much noisier than those in synthetic dataset, while combining them together indeed helps it to predict much more accurate results. In terms of quantitative evaluation, our model (with both observations as input) achieves an RMSE value of 18.7 over the whole dataset and 10.1 over the block towers with two blocks (the RMSE value of random guessing is 40.8).
To better understand our model, we present several analysis. The first three are conducted on synthetic block towers and the last one is on our real dataset.
Learning Speed with Different Supervisions. We show the learning curves of our PPD model with different supervision in Figure 9. Model supervised by physics observation reaches the same level of performance of model with texture supervision using much fewer training steps (500K vs. 2M). Supervised by both observations, our PPD model preserves the learning speed of the model with only physics supervision, and further improves its performance.
Preference over Possible Values. We illustrate the confusion matrices of physical parameter estimation in Figure 10. Although our PPD model performs similarly either with only texture as input or with physics as input, its preferences over all possible values turn out to be quite different. With texture as input (in Figure 9(a)), it tends to guess within the possible values of the corresponding material (see Table 1), while with physics as input (in Figure 9(b)), it only makes errors between very close values. Therefore, the information provided by two types of inputs is orthogonal to each other (in Figure 9(c)).
Impact of Primitive Numbers. As demonstrated in Table 5, the number of blocks has nearly no influence on the model with texture as input. With physics interactions as input, the model performs much better on fewer blocks, and its performance degrades when the number of blocks starts increasing. The degradation is probably because the physical response of any rigid body is fully characterized by a few object properties (i.e., total mass, center of mass, and moment of inertia), which provides us with limited constraints on the density distribution of an object when the number of primitives is relatively large.
|Observation||2 blocks||3 blocks||4 blocks||5 blocks||Overall|
|Texture + Physics||2.3||4.9||7.8||10.9||8.0|
Human Studies. We select the block towers with two blocks from our real dataset, and study the problem of “which block is heavier” upon them. The human studies are conducted on the Amazon Mechanical Turk. For each block tower, we provide 25 annotators with an image and a video of physics interaction, and ask them to estimate the ratio of mass between the upper and the lower block. Instead of directly predicting a real value, we require the annotators to make a choice on a log scale, i.e., from . Results of average human’s predictions, model’s predictions and the truths are shown in Figure 11. Our model performs comparably to humans, and its response is also highly correlated with humans: the Pearson’s coefficient of “Human vs. Model”, “Human vs. Truth” and “Model vs. Truth” is 0.69, 0.71 and 0.90, respectively.
In this paper, we have formulated and studied the problem of physical primitive decomposition (PPD), which is to approximate an object with a set of primitives, explaining its geometry and physics. To this end, we proposed a novel formulation that takes both visual and physics observations as input. We evaluated our model on several different setups: synthetic block towers, synthetic tools and real-world objects. Our model achieved good performance on both synthetic and real data.
Acknowledgements: This work is supported by NSF #1231216, ONR MURI N00014-16-1-2007, Toyota Research Institute, and Facebook.
-  Agrawal, P., Nair, A., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: Experiential learning of intuitive physics. In: NIPS (2016)
-  Attene, M., Falcidieno, B., Spagnuolo, M.: Hierarchical mesh segmentation based on fitting primitives. The Visual Computer 22(3), 181–193 (2006)
Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physical scene understanding. PNAS110(45), 18327–18332 (2013)
-  Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: CVPR (2015)
-  Biederman, I.: Recognition-by-components: a theory of human image understanding. Psychol. Rev. 94(2), 115 (1987)
-  Binford, T.O.: Visual perception by computer. In: IEEE Conf. on Systems and Control (1971)
-  Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Physics-based person tracking using the anthropomorphic walker. IJCV 87(1-2), 140 (2010)
-  Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv:1512.03012 (2015)
-  Coumans, E.: Bullet physics engine. Open Source Software: http://bulletphysics. org (2010)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
-  Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: CVPR (2011)
-  Gupta, A., Efros, A.A., Hebert, M.: Blocks world revisited: Image understanding using qualitative geometry and mechanics. In: ECCV (2010)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2015)
-  van den Hengel, A., Russell, C., Dick, A., Bastian, J., Pooley, D., Fleming, L., Agapito, L.: Part-based modelling of compound scenes from images. In: CVPR (2015)
Huang, H., Kalogerakis, E., Marlin, B.: Analysis and synthesis of 3D shape families via deep-learned generative models of surfaces. CGF34(5), 25–38 (2015)
-  Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis of image and shape collections. ACM TOG 34(4), 87 (2015)
-  Igarashi, T., Matsuoka, S., Tanaka, H.: Teddy: a sketching interface for 3d freeform design. In: SIGGRAPH (1999)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
-  J. Gibson, J.: The theory of affordances. The Ecological Approach to Visual Perception 8. (1977)
-  Jia, Z., Gallagher, A., Saxena, A., Chen, T.: 3d reasoning from blocks to stability. IEEE TPAMI 37(5), 905–918 (2015)
-  Kalogerakis, E., Chaudhuri, S., Koller, D., Koltun, V.: A probabilistic model for component-based shape synthesis. ACM TOG 31(4), 55 (2012)
-  Kim, M., Pons-Moll, G., Pujades, S., Bang, S., Kim, J., Black, M.J., Lee, S.H.: Data-driven physics for human soft tissue animation. In: SIGGRAPH (2017)
-  Kim, V.G., Li, W., Mitra, N.J., Chaudhuri, S., DiVerdi, S., Funkhouser, T.: Learning part-based templates from large collections of 3d shapes. ACM TOG 32(4), 70 (2013)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
-  Koppula, H.S., Saxena, A.: Physically grounded spatio-temporal object affordances. In: ECCV (2014)
-  Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: ICML (2016)
-  Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Quarterly of applied mathematics 2(2), 164–168 (1944)
Li, J., Xu, K., Chaudhuri, S., Yumer, E., Zhang, H., Guibas, L.: Grass: Generative recursive autoencoders for shape structures. In: SIGGRAPH (2017)
-  Li, W., Leonardis, A., Fritz, M.: Visual stability prediction for robotic manipulation. In: ICRA (2017)
-  Li, Y., Wu, X., Chrysathou, Y., Sharf, A., Cohen-Or, D., Mitra, N.J.: Globfit: Consistently fitting primitives by discovering global relations. ACM TOG 30(4), 52 (2011)
-  Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)
-  Marquardt, D.W.: An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics 11(2), 431–441 (1963)
-  Miller, A.T., Knoop, S., Christensen, H.I., Allen, P.K.: Automatic grasp planning using shape primitives. In: ICRA (2003)
-  Mottaghi, R., Rastegari, M., Gupta, A., Farhadi, A.: “what happens if…” learning to predict the effect of forces in images. In: ECCV (2016)
-  Nooruddin, F.S., Turk, G.: Simplification and repair of polygonal models using volumetric techniques. IEEE TVCG 9(2), 191–205 (2003)
-  Pham, T.H., Kheddar, A., Qammaz, A., Argyros, A.A.: Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces. In: CVPR (2015)
-  Rivlin, E., Dickinson, S.J., Rosenfeld, A.: Recognition by functional parts. CVIU 62(2), 164–176 (1995)
-  Roberts, L.G.: Machine perception of three-dimensional solids. Ph.D. thesis, Massachusetts Institute of Technology (1963)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV40(2), 99–121 (2000)
-  Savva, M., Chang, A.X., Hanrahan, P.: Semantically-enriched 3d models for common-sense knowledge. In: CVPR Workshop (2015)
-  Schnabel, R., Degener, P., Klein, R.: Completion and reconstruction with primitive shapes. CGF 28(2), 503–512 (2009)
-  Soo Park, H., Shi, J., et al.: Force from motion: decoding physical sensation in a first person video. In: CVPR (2016)
-  Tenenbaum, J.B.: Functional parts. In: CogSci (1994)
-  Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. In: CVPR (2017)
-  Van Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: ICML (2016)
-  Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. MLJ 8(3-4), 229–256 (1992)
-  Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: Learning physical object properties from unlabeled videos. In: BMVC (2016)
-  Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: NIPS (2017)
-  Wu, J., Yildirim, I., Lim, J.J., Freeman, W.T., Tenenbaum, J.B.: Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: NIPS (2015)
-  Yao, B., Ma, J., Fei-Fei, L.: Discovering object functionality. In: ICCV (2013)
-  Yumer, M.E., Kara, L.B.: Co-abstraction of shape collections. ACM TOG 31(6), 166 (2012)
-  Zhao, Y., Zhu, S.C.: Scene parsing by integrating function, geometry and appearance models. In: CVPR (2013)
Zheng, D., Luo, V., Wu, J., Tenenbaum, J.B.: Unsupervised learning of latent physical properties using perception-prediction networks. In: UAI (2018)
-  Zheng, Y., Cohen-Or, D., Averkiou, M., Mitra, N.J.: Recurring part arrangements in shape collections. CGF 33(2), 115–124 (2014)
-  Zhu, Y., Zhao, Y., Zhu, S.C.: Understanding tools: Task-oriented object modeling, learning and recognition. In: CVPR (2015)
-  Zhu, Y., Fathi, A., Fei-Fei, L.: Reasoning about object affordances in a knowledge base representation. In: ECCV (2014)
-  Zou, C., Yumer, E., Yang, J., Ceylan, D., Hoiem, D.: 3d-prnn: Generating shape primitives with recurrent neural networks. In: ICCV (2017)
Appendix A.1 Implementation Details
We present some implementation details about network architecture and training.
3D ConvNet. As the building block of voxel encoder, this network consists of five volumetric convolutional layers, with numbers of channels , kernel sizes 33
3, and padding sizes 1. Between convolutional layers, we add batch normalization
, Leaky ReLU
with slope 0.2 and max-pooling of size 222. At the end of the network, we append two additional 111 volumetric convolutional layers.
Network Details. As the inputs fed into different encoders, voxels , images and trajectories are of size 1323232, 3224224 and 2567, respectively. The dimensions of output features from encoders, , and
, are all 64. Inside both trajectory encoder and primitive generator, we employ the Long Short-Term Memory (LSTM) cell with hidden sizes of 64 and dropout rates of 0.5 as recurrent unit. The trajectory encoder uses a single-layer recurrent neural network, while the primitive generator applies three layers of recurrently connected units.
We implement our PPD model in PyTorch†††http://pytorch.org. For the image encoder, we make use of the weights of ResNet-18  pre-trained on ImageNet  and replace its final classification layer with a fully-connected layer, while for other modules, we initialize their weights randomly. During optimization, we first train the geometric parameters (by setting to 0), and then we train all parameters jointly. Optimization is carried out using ADAM  with and . We use a learning rate of and mini-batch size of 8.