Scene understanding requires – among others – understanding of relations between and among the objects. Many of these relations are governed by the Newtonian laws and thereby rule out unlikely or even implausible configurations for the observer. They are part of “dark matter”  in our everyday visual data which helps us interpret the configurations of objects correctly and accurately. Although objects simply obey these elementary laws of Newtonian mechanics, which can very well be captured in simulators, uncertainty in perception makes exploiting these relations challenging in artificial systems.
In contrast, humans understand such physical relations naturally, which e.g. enables them to manipulate and interact with novel objects in unseen conditions with ease. We build on a rich set of prior experiences that allow us to employ a type of commonsense understanding that does not – most likely – involve symbolic representation of 3D geometry that is processed by a physics simulation engine. We rather seem to build on what has been coined as “naïve physics”  or “intuitive physics” , which is a good enough proxy to make us operate successfully in the real-world.
It has not been shown yet how to equip machines with a similar set of physics commonsense – and thereby bypassing a model–based representation and a physical simulation. In fact, it has been argued that such an approach is unlikely due to e.g. the complexity of the problem 
. Only recently, several approach have revived this idea and reattempted a fully data drive approach to capturing the essence of physical events via machine learning methods[21, 29, 10].
In contrast, studies in developmental psychology  have shown that infants acquire the knowledge of physical events by observation at a very early age, including support, collision and unveiling. According to their research, the infant with some innate basic core knowledge  gradually builds its internal model of the physical event by observing its various outcomes. Amazingly, such basic knowledge of physical event, for example the understanding of support phenomenon can make its way into relatively complex operations as shown in Figure 1. Such structures are generated by stacking up an element or removing one while retaining the structure’s stability primarily relying on effective knowledge of support events in such toy constructions. In our work, we focus on exactly this support event and construct a model for machines to predict object stability.
Hence, we revisit the classic setup of Tenenbaum and colleagues  and explore to which extend machines can predict physical stability events directly from appearance cues. We approach this problem by synthetically generating a large set of wood block towers with a range of conditions, including varying number of blocks, varying block sizes, more planar vs. multi-layered configurations. We run those configurations through a simulator (only at training time! ) in order to generate labels if the tower would fall. We show for the first time that aforementioned stability test can be learned and predicted in a purely data driven way – bypassing traditional model-based simulation approaches. In order to shed more light on the capabilities and limitations of our model, we accompany our experimental study with human judgments on the same stimuli.
2 Related Work
As human, we possesses the ability to judge from vision alone if an object is physically stable or not and predict the objects’ physical behaviors. Yet it is unclear: (1) how do we make such decision and (2) how do we acquire this capability. Research in development psychology [2, 1, 3] suggests that infants acquire the knowledge of physical events at very young age by observing those events, including support events and others. This partly answers to the question (2), however there seems no consensus on how the internal mechanism for interpreting external physical events to address question (1).  proposed an intuitive physics simulation engine for such mechanism and found it resemble to human subjects’ behaviors pattern in several psychological tasks. Historically, intuitive physics is connected to the case where people often hold erroneous physical intuitions , such as people tend to expect an object dropped from a moving subject will fall vertically straight down. It is rather counter-intuitive how the proposed simulation engine in  can explain such erroneous intuitions.
While it is probably illusive to fully reveal human’s inner mechanism for physical modeling and inference, it is feasible to build up models based on observation, in particular the visual information. In fact, looking back to history, physical laws are discovered through the observation of physical events. Our work is in this direction. By observing a large number of support event instances in simulation, we want to gain deeper insight into the prediction paradigm.
In our work, we use a game engine to render scene images and a built-in physics simulator to simulate the scenes’ stability behavior. The data generation procedure is based on the platform used in 
, however as discussed before, their work hypothesized a simulation engine as an internal mechanism for human to understand the physics in the external world while we are interested in finding an image-based model to directly predict the physical behavior from visual channel. Learning from synthetic data has a long tradition in computer vision and recently has gained increasing interest[18, 25, 22, 24]
due to data hungry deep learning approach.
Understanding physical events also plays an important role in scene understanding in computer vision. By including the additional clue from physical constraints into the inference mechanism, mostly from the support event, it has further improved results in segmentation of surfaces , scenes  from image data, and object segmentation in 3D point cloud data .
Only very recently, learning physical concepts from data has been attempted.  aims at understanding dynamic events governed by laws of Newtonian physics, but uses proto-typical motion scenarios as exemplars.  analyze a billiard table scenarios and aim at learning the dynamics from observation. While learning is largely data-driven, the object notion is predefined as location of the balls are provided to the system.  aims to understand physical properties of objects. They again rely on a explicit physical simulation.
In contrast, we only use simulation at training time and predict for the first time visual stability directly from visual inputs of towers with a large number of degrees of freedom.
A recent paper  that appeared a few days before this work is a related research thread that was conducted in parallel without our knowledge. The focus of their work is different from ours, namely predicting outcome and falling trajectories for simple 4 block scenes. In our work, we significantly vary the scene parameters, investigate if and how the prediction performance from image trained model changes according to such changes, and further we examine how the human’s prediction adapt to the variation in the generated scenes and compare it to our model.
3 Towards Modeling a Visual Stability Test
In order to tackle a visual stability test, we require a data generation process that allows us to control various degrees of freedom induced by the problem as well as generation of large quantities of data in a repeatable setup. Therefore, we follow the pioneer work on this topic  and use a simulator to setup and predict physical outcomes of wood block towers. Afterwards, we describe the method that we investigate for visual stability prediction. We employ state of the art deep learning techniques, which are the de facto standard in today’s recognition systems. Lastly, we describe the setup of the human study that we conduct to complement the machine predictions with a human reference.
3.1 Synthetic Data
Based on the scene simulation framework used in [13, 5], we generate synthetic data in our experiment with rectangular cuboid blocks as basic elements. Number of blocks, block size, stacking depth are varied in different scenes, to which we will refer as scene parameters.
We expect that varying size of towers and involved blocks will influence the difficulty and challenge the competence of eye-balling the stability of a tower in humans and machine. While evidently the appearance becomes more complex, with increasing number of blocks, the number of contact surfaces and interactions equally make the problem richer. Therefore, we include scenes with four different numbers of blocks, 4 blocks, 6 blocks, 10 blocks and 14 blocks as .
As we focus our investigations on judging stability from monocular input, we vary the depth of the tower from a one layer setting which we call to a multi-layer setting which we call . The first one only allows a single block along the image plane at all height levels while the other does not enforce such constraint and can expand in the image plane. Visually, the former results in a single-layer stacking similar to Tetris while the latter ends in a multiple-layer structure as shown in Figure 2. The latter most likely requires the observer to pick up on more subtle visual cues, as its layers are heavily occluded.
We include two groups of block size settings. In the first one, the towers are constructed of blocks that have all the same size of as in the 
. The second one introduces varying block sizes where two of the three dimensions are randomly scaled with respect to a truncated Normal distributionaround , and are small values. These two settings are referred to as . The setting with non uniform blocks introduces small visual cues where stability hinges on small gaps between differently sized blocks that are challenging even for human observers.
Combining these three scene parameters, we define different scene groups. For example, group 10B-2D-Uni is for scenes stacked with 10 Blocks of same size, stacked within a single layer. For each group, candidate scenes are generated where each scene is constructed with non-overlapping geometrical constraint in a bottom-up manner. There are scenes in total. For prediction experiments, half of the images in each group are for training and the other half for test, the split is fixed across the experiments.
While we keep the rendering basic, we like to point out that we deliberately decided against colored bricks as in  in order to challenge perception and make identifying brick outlines and configurations more challenging. The lighting is fixed across scenes and the camera is automatically adjusted so that the whole tower is centered in the captured image. Images are rendered at resolution of in color.
We use Bullet  physics engine in Panda3D  to perform physics-based simulation for at for each scene. Surface friction and gravity are enabled in the simulation. The system records the configuration of a scene of blocks at time as , where is the location for block . The stability is then automatically decided as a Boolean variable:
where is the end time of simulation, measures the displacement for the blocks between the starting point and end time, is the displacement threshold, denotes the logical and operator, that is to say it counts as unstable if any block in the scene moved in simulation, otherwise as stable .
3.2 Stability Prediction from Still Images
Inspiration from Human
Research in [13, 5] suggests the combinations of the most salient features in the scenes are insufficient to capture people’s judgments, however, contemporary study reveals human’s perception of visual information, in particular some geometric feature, like critical angle [6, 7] plays an important role in the process. Regardless of the actual inner mechanism for human to parse the visual input, it is clear there is a mapping involving visual input to the stability prediction .
Image Classifier for Stability Prediction
In our work, we are interested in the mapping
exclusive to visual input and directly predicts the physical stability. We use deep convolutional neural networks to learn the mapping as it has shown great success on image classification task. Such networks have been shown to be able to adapt to a wide range of classification and prediction task  through re-training or adaptation by fine-tuning. Therefore, these approaches seem adequate method to study visual prediction on this challenging task with the motivation that by changing conventional image classes labels to stability labels the network can learn “physical stability salient” features.
, a even larger network than AlexNet. We trained from scratch for the LeNet and fine-tuned for the large network pre-trained on ImageNet
. VGG Net consistently outperforms the other two, hence we use it across our experiment. We use the Caffe framework in all our experiments.
3.3 Human Subject Study
We recruit human subjects to predict stability for give scene images. Due to large number of test data, we sample images from different scene groups for human subject test. 8 subjects are recruited for the test. Each subject is presented with a set of captured images from the test split. Each set includes images where images cover all scene groups with scene instances per group. For each scene image, subject is required to rate the stability on a scale from without any constraint for response time:
Definitely unstable: definitely at least one block will move/fall
Probably unstable: probably at least one block will move/fall
Cannot tell: the subject is not sure about the stability
Probably stable: probably no block will move/fall
Definitely stable: definitely no block will move/fall
The main questions that we aim at answering with our study are as follows:
How well do humans perform on the task?
This provides a basic reference of the task difficulty. High human performance may indicate the task is too simple while low performance for too difficulty for the setup. In addition, human performance is also a baseline to our image-based model.
How does human performance vary with respect to scene variations?
This provides an additional reference to the task difficulty concerning the scene parameter. If certain scene parameter does not influence too much on the human performance, it can be less relevant to human perception for the task. If the human performance changes along with the variation of the scene parameter, then we can further distinguish between more dominated factors and less dominated ones. Moreover, it serves as a parallel baseline to compare with the image-based model to see if both are affected by the presented challenges equally.
How does human performance compare to image-based model?
This answers if the data-driven image-based approach can match or even possible to outperform human.
How do human vs. machine confidences relate?
The setup of our human study provides as with confidence values 1-5, that reflect the certainty in the human judgment. We would like to investigate how these confidences are mimicked by our data-drive approach.
How do failure cases differ? Are they plausible?
In order to gain more insights in our model and analyze to what extend a scene understanding of real physics or a certain human notion of commonsense physics was achieved, we can compare failure cases and modes between human and machine prediction.
Since we want to compare the performance between the image-based model and human, we use the same data for both in test. More details will be discussed in the following section.
Our experimental analysis is composed of two main part. The first will study if and to what extend a largely model free visual stability test can be learned directly from data only. The second part, will put these finding in relation to human judgments on the same stimuli.
4.1 Visual Stability Prediction
In this part of experiments, image are captured before the physics engine is enabled, and the stability labels are recorded from the simulation engine as described before. At training time, the model has access to image and the stability labels. At test time, the learned model predicts stability results against the results generated by the simulator.
We divide the experiment design into 3 sets: the intra-group, cross-group and generalization. The first set investigates influence on the model’s performance from individual scene parameter, the other two sets explore generalization properties under different settings.
4.1.1 Intra-Group Experiment
In this set of experiments, we train and test on the scenes with the same scene parameters in order to assess the feasibility of our task.
Number of Blocks
In this group of experiment, we fix the stacking depth and keep the all blocks in the same size but vary the number of blocks in the scene to observe how it affects the prediction rates from the image trained model, which approximates the relative recognition difficulty from this scene parameter alone. The results have been shown in Table 1. Consistent drop of performance can be observed with increasing number of blocks in the scene under various block sizes and stacking depth conditions. More blocks in the scene generally leads to higher scene structure and hence higher difficulty in perception.
In this group of experiment, we aim to explore how same size and varied blocks sizes affect the prediction rates from the image trained model. We compare the results at different number of blocks to the previous group, in the most obvious case, scenes happened to have similar stacking patterns and same number of blocks can result in changes visual appearance. To further eliminate the influence from the stacking depth, we fix all the scenes in this group to be 2D stacking only. As can be seen from Table 1, the performance decreases when moving from 2D stacking to 3D. The additional variety introduced by the block size indeed makes the task more challenging.
In this group of experiment, we want to investigate how stacking depth affects the prediction rates. With increasing stacking depth, it naturally introduces ambiguity in the perception of the scene structure, namely some parts of the scene can be occluded or partially occluded by other parts. Similar to the experiments in previous groups, we want to minimize the influences from other scene parameters, we fix the block size to be the same and only observe the performance across different number of blocks. The results in Table 1 show a little inconsistent behaviors between relative simple scenes (4 blocks and 6 blocks) and difficult scenes (10 blocks and 14 blocks). For simple scenes, prediction accuracy increases when moving from stacking to while it is the other way around for the complex scene. Naturally relaxing the constraint in stacking depth can introduce additional challenge for perception of depth information, yet given a fixed number of blocks in the scene, the condition change is also more likely to make the scene structure lower which reduces the difficulty in perception. A combination of these two factors decides the final difficulty of the task, for simple scenes, the height factor has stronger influence and hence exhibits better prediction accuracy for over stacking while for complex scenes, the stacking depth dominates the influence as the significant higher number of blocks can retain a reasonable height of the structure, hence receives decreased performance when moving from stacking to .
4.1.2 Cross-Group Experiment
In this set of experiment, we want to see how the learned model transfers across scenes with different complexity, so we further divide the scene groups into two large groups by the number of blocks, where a simple scene group for all the scenes with and blocks and a complex scene for the rest of scenes with and blocks. We investigate in two-direction classification, shown in Figure 6, namely:
Train on simple scenes and predict on complex scenes: Train on 4 and 6 blocks and test on 10 and 14 blocks
Train on complex scenes and predict on simple scenes: Train on 10 and 14 blocks and test on 4 and 6 blocks
The result is shown in Table 2. When trained on simple scenes and predicting on complex scenes, it gets , which is significantly better than random guess at . This is understandable as the learned visual feature can transfer across different scene. Further we observe significant performance boost when trained on complex scenes and tested on simple scene. This can be explained by the richer feature learned from the complex scenes with better generalization.
|Setting||Simple Complex||Complex Simple|
4.1.3 Generalization Experiment
In this set of experiment, we want to explore if we can train a general model to predict stability for scenes with any scene parameters, which is very similar to human’s prediction in the task. We use training images from all different scene groups and test on any groups. The Result is shown in Table 3. While the performance exhibits similar trend to the one in the intra-group with respect to the complexity of the scenes, namely increasing recognition rate for simpler settings and decreasing rate for more complex settings, there is a consistent improvement over the intra-group experiment for individual groups. Together with the result in the cross-group experiment, it suggests a strong generalization capability of the image trained model.
Overall, we can conclude that direct stability prediction is possible and in fact fairly accurate at recognition rates over for moderate difficulty levels. As expected, the 3D setting adds difficulties to the prediction from appearance due to significant occlusion for towers of more than 10 blocks. Surprisingly, little effect was observed for small tower sizes switching from uniform to non-uniform blocks - although the appearance difference can be quite small.
4.2 Human Judgment on Synthetic Data
For human subject test, the predictions are binarized, namely “definitely unstable” and “probably unstable” are treated as unstable prediction and “probably stable” and “definitely stable” as stable prediction regardless of the certainty quantifiers. The results are shown in Table4 and Figure 11.
4.2.1 How well do humans perform on this task?
For very simple scenes with few blocks, human can reach close to perfect performance while for complex scenes, the performance drops significantly to around .
4.2.2 How does human performance vary with respect to scene variations
As we discuss before, the number of blocks indicates the scene’s complexity, given the same block size and stacking depth condition, the human’s prediction degrades with increasing number of blocks in the scene in general. Given the same number block size condition and the number of blocks in the scene, the human’s predictions in 3D stacking are better than the counterpart in 2D. This can partially be explained by the factor that the structure can have larger chance to be lower than the scene with stacking constraints, and the decreased height in return reduces the scene’s complexity for human’s judgment. The varied blocks size consistently shows higher difficulty than the fixed blocks size as in most cases, when one scene group changes the block condition from “Uni” to “NonUni”, the performance decreases.
4.2.3 How does human performance compare to image-based model?
Compared to the human prediction in the same part of test data, the image-based model outperforms human in most scene groups. While showing similar trends in performance with respect to different scene parameters, the image-based model is less affected by a more difficult scene parameter setting, for example, given the same block size and stacking depth condition, the prediction accuracy decreases more slowly than the counter part in human prediction. We interpret this as image-based model possesses better generalization capability than human in the very task.
To gain further insight into the results, we plot the average accuracy against each scene parameter alone. The results are shown in Figure 7. Both human and the image-based model decrease consistently with respect to the number of blocks in the scene. However for both stacking depth and block size, human and the image-based model exhibit different trends, while the image-based model always outperforms the human, the human performance catches up in more complex scene parameter settings.
4.2.4 How do failure cases differ? Are they plausible?
In our test, it shows that human prone to make mistake for scenes in significant height while the machine is less affected by the factor. This is also consistent with the observation in  that height plays an important role in human’s judgment for stability. In contrast, the machine is trained across different heights, and hence can adapt to more variation. Top row in Figure 8 shows some examples of such scenes. On the other hand, the machine makes more mistake when the scenes are constructed multiple layers than human. This is understandable as our model is only trained on monocular images while the human has the prior knowledge for perception of depth information. Examples are shown in the bottom row in Figure 8. Figure 10 provides further examples of some false predictions of stable from unstable for both human and machine. While the occlusion condition can both affect human and machine for the judgment, it affects the machine more than human which is consistent with the false prediction of unstable. Similarly, height affects human more than machine as in false unstable predictions.
Further, we count the histogram of human’s rating and the prediction confidence from the image based model (for visualization purpose, we quantized the prediction confidence into 5 bins). The result is shown in Figure 9. It’s interesting to see the two distributions are relatively similar.
Correlation between human performance and machine
Different from , our work does not aim to reconstruct human’s inner mechanism hence the correlation between the human’s prediction and the model’s is not our priority. Yet we list such statistics to provide a more comprehensive image of the results to the reader. Here we shown the scatter plots for the pair of human’s prediction and machine’s prediction by different scene parameters, namely number of blocks (Figure 12), stacking depth (Figure 13) and block sizes (Figure 14). We computed the Pearson correlation coefficient for each group. The detailed values are shown in Table 5 6 7. Interestingly, human prediction and human prediction are moderately positive correlated.
In this work, we answer to the question if and how well we can build up a mechanism to predict physical stability directly from visual input. In contrast to existing approaches we bypass explicit 3D representation and physical simulation and learn a model for visual stability prediction from data. We evaluate our model on a range of conditions including variations in number of blocks, size of blocks and 3D structure of the overall tower. The results reflect the challenges of an increasing complex inference with increasing size of the structure as well as challenges due to small features the stability hinges on due to occlusions or block size variations.
Based on these encouraging results we envision systems that exploit such data driven notions of physics to arrive at advanced methods for scene understanding that reason on physical plausible state during visual inference. We also will investigate richer output spaces than binary labels that shed more light on the quality of physical understanding that was acquired by the learning based approach.
-  Baillargeon, R.: A model of physical reasoning in infancy. Advances in infancy research (1995)
-  Baillargeon, R.: How do infants learn about the physical world? Current Directions in Psychological Science (1994)
-  Baillargeon, R.: The acquisition of physical knowledge in infancy: A summary in eight lessons. Blackwell handbook of childhood cognitive development (2002)
-  Baillargeon, R.: Innate ideas revisited: For a principle of persistence in infants’ physical reasoning. Perspectives on Psychological Science (2008)
-  Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences (2013)
-  Cholewiak, S.A., Fleming, R.W., Singh, M.: Visual perception of the physical stability of asymmetric three-dimensional objects. Journal of vision (2013)
-  Cholewiak, S.A., Fleming, R.W., Singh, M.: Perception of physical stability and center of mass of 3-d objects. Journal of vision (2015)
-  Coumans, E.: Bullet physics engine. Open Source Software: http://bulletphysics. org 1 (2010)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
-  Fragkiadaki, K., Agrawal, P., Levine, S., Malik, J.: Learning visual predictive models of physics for playing billiards. arXiv preprint arXiv:1511.07404 (2015)
-  Goslin, M., Mine, M.R.: The panda3d graphics engine. Computer (2004)
-  Gupta, A., Efros, A.A., Hebert, M.: Blocks world revisited: Image understanding using qualitative geometry and mechanics. In: ECCV (2010)
-  Hamrick, J., Battaglia, P., Tenenbaum, J.B.: Internal physics models guide probabilistic judgments about object dynamics. In: Proceedings of the 33rd annual conference of the cognitive science society. Cognitive Science Society Austin, TX (2011)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia. ACM (2014)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
-  LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, U., Sackinger, E., et al.: Comparison of learning algorithms for handwritten digit recognition. In: International conference on artificial neural networks (1995)
-  Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312 (2016)
-  Li, W., Fritz, M.: Recognizing materials from virtual examples. In: ECCV (2012)
-  MacDougal, D.W.: Galileo’s great discovery: How things fall. In: Newton’s Gravity. Springer (2012)
-  McCloskey, M.: Intuitive physics. Scientific american (1983)
-  Mottaghi, R., Bagherinezhad, H., Rastegari, M., Farhadi, A.: Newtonian image understanding: Unfolding the dynamics of objects in static images. arXiv preprint arXiv:1511.04048 (2015)
-  Peng, X., Sun, B., Ali, K., Saenko, K.: Learning deep object detectors from 3d models. In: ICCV (2015)
-  Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-shelf: an astounding baseline for recognition. In: CVPR workshops (2014)
-  Rematas, K., Ritschel, T., Fritz, M., Gavves, E., Tuytelaars, T.: Deep reflectance maps. In: CVPR (2016)
-  Rematas, K., Ritschel, T., Fritz, M., Tuytelaars, T.: Image-based synthesis and re-synthesis of viewpoints guided by 3d models. In: CVPR (2014)
-  Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Smith, B., Casati, R.: Naive Physics: An Essay in Ontology. Philosophical Psychology (1994)
-  Wu, J., Yildirim, I., Lim, J.J., Freeman, B., Tenenbaum, J.: Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: NIPS (2015)
-  Xie, D., Todorovic, S., Zhu, S.C.: Inferring ”dark matter” and ”dark energy” from videos. In: ICCV (2013)
-  Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., Zhu, S.C.: Beyond point clouds: Scene understanding by reasoning geometry and physics. In: CVPR (2013)